Hi Rick, et al,

Has a diff of this code been sent out (to properly track file I/O on
checkpoint restore)?  I think I am running into the same problem on some
benchmarks.  I can write the code if necessary, but if it's already been
done... :).

Thanks,

    -Vilas

On Nov 15, 2007 6:54 PM, Ali Saidi <[EMAIL PROTECTED]> wrote:

> Yes please send us a diff when you're done and have tested the code.
>
> Thanks,
> Ali
>
> On Nov 15, 2007, at 6:51 PM, Rick Strong wrote:
>
> > Your collective trust of m5 guru-ness was right. I am now saving
> > file offset, host flags, mode, name and able to bring it back from
> > checkpoint. Most of my changes were localized to the alloc_fd and
> > free_fd function in process.c by including a few more parameters. I
> > just have to correctly handle pipe, and I will post a diff if any
> > are interested. I went with the parallel array technique to just
> > avoid too many changes, and that may reduce desirability.  Let me
> > know.
> >
> > -R
> >
> > Steve Reinhardt wrote:
> >> Hmm, yea, I think making this work robustly in all cases is
> >> non-trivial, but you can probably at least fix your bug pretty
> >> easily.
> >>
> >> Basically the fd_map array in Process is the key; all the non-
> >> negative
> >> entries in this array are open file descriptors in the target
> >> process,
> >> and the value of the entry is the file descriptor in m5 that it
> >> corresponds to.
> >>
> >> Rather than saving and restoring this array literally, like we are
> >> now
> >> (which actually makes no sense), you should serialize for every
> >> non-negative fd the filename, mode (ro, rw) and offset, then reopen
> >> the file, seek to that offset, and store the new fd in the array
> >> entry
> >> in the unserialize method.
> >>
> >> To have the filename and mode around you'll need to save that in a
> >> parallel array (or better expand fd_map to a struct) on the "open"
> >> calls.  You'll have to special-case stdin/stdout etc if they don't
> >> get
> >> reassigned.
> >>
> >> To be really thorough you'll have to handle dup'd fds, etc. specially
> >> too, but that's probably optional in terms of getting past your bug.
> >>
> >> Hope that helps...
> >>
> >> Steve
> >>
> >> On Nov 14, 2007 9:38 PM, Rick Strong <[EMAIL PROTECTED]> wrote:
> >>
> >>> All right, I am in the process on understanding how it all works.
> >>> Where
> >>> is a good place to start. I am right now looking through sim/
> >>> process.*
> >>> and sim/syscall_emul* to work backwards to where all the
> >>> information is
> >>> stored. If someone has insight on this system and could offer a
> >>> brief
> >>> description of how it works, it would be very helpful.
> >>>
> >>> -Richard
> >>>
> >>>
> >>> Nathan Binkert wrote:
> >>>
> >>>> When you fix this, pretty please submit a diff :)
> >>>>
> >>>>
> >>>>> I'm pretty sure I figured it out and I'm pretty sure it is
> >>>>> related to
> >>>>> file I/O. When we restore from a checkpoint we don't reopen and
> >>>>> seek
> >>>>> to the appropriate place in any files we were reading from/writing
> >>>>> to. I bet what is happening is that the benchmark attempts to read
> >>>>> some input data (or maybe write some data) and the file
> >>>>> descriptor is
> >>>>> invalid when M5 passes the syscall through to the host OS. The OS
> >>>>> returns an error code which alters the path of the benchmark and
> >>>>> it
> >>>>> exits early. It shouldn't be too hard to fix but I don't have
> >>>>> time to
> >>>>> do it at the moment. You would need to keep track of all the open
> >>>>> files paths and modes and add the paths/modes to the checkpoint
> >>>>> along
> >>>>> with the current position (via tell()). Upon restoring from a
> >>>>> checkpoint you would reopen the files and seek() to the
> >>>>> appropriate
> >>>>> place in the file.
> >>>>>
> >>>>> Ali
> >>>>>
> >>>>> On Nov 14, 2007, at 10:02 PM, Rick Strong wrote:
> >>>>>
> >>>>>
> >>>>>> When I take a checkpoint in AtomicSimpleCPU (m5_2.0b4) at
> >>>>>> curTick=100015476500 (approx. 200,000,000 insts into the
> >>>>>> binary) in
> >>>>>> mcf, and resume execution in any CPU model, I get an exit syscall
> >>>>>> (syscall trace included below) at cycle 100522711000 (approx
> >>>>>> 1014345
> >>>>>> insts into execution). What is strange is that if I run
> >>>>>> AtomicSimpleCPU through this point (from start), I have no
> >>>>>> problems.
> >>>>>> Any ideas on either the problem or how to debug?
> >>>>>>
> >>>>>> It turns out that the same problem happens for checkpoints in
> >>>>>> twolf
> >>>>>> about 200,000,000 insts into the binary. A resume has some file
> >>>>>> i/o
> >>>>>> and an untimely exit. Both problems seem related to file i/o and
> >>>>>> then an exit call. Is it possible that some system call is not
> >>>>>> implemented and defaulting to exit. I included the syscall
> >>>>>> trace for
> >>>>>> twolf for any interested parties:
> >>>>>>
> >>>>>> I have resumed both checkpoints, immediately created new
> >>>>>> checkpoints, and they diff clean (except for order of the ptable
> >>>>>> entries).
> >>>>>>
> >>>>>> I am right now working on getting an EXEC trace for mcf, one from
> >>>>>> checkpoint and one executing from the beginning to find any
> >>>>>> differences.
> >>>>>>
> >>>>>>
> >>>>>> TWOLF syscall trace
> >>>>>> "
> >>>>>> 100285445500: system.cpu: pc 4832275812 syscall read called
> >>>>>> w/arguments 4,5368834056,8192,1
> >>>>>> 100285445500: system.cpu: syscall read returns
> >>>>>> 18446744073709551615
> >>>>>> 100286500500: system.cpu: pc 4832275812 syscall read called
> >>>>>> w/arguments 4,5368834056,8192,5
> >>>>>> 100286500500: system.cpu: syscall read returns
> >>>>>> 18446744073709551615
> >>>>>> 100287514000: system.cpu: pc 4832260836 syscall close called
> >>>>>> w/arguments 0,4831383888,1,1048576
> >>>>>> 100287514000: system.cpu: syscall close returns 0
> >>>>>> 100287679500: system.cpu: pc 4832260628 syscall write called
> >>>>>> w/arguments 1,5368796680,172,1048576
> >>>>>>
> >>>>>> TimberWolfSC version:v4.3a date:Mon Jan 25 18:50:36 EST 1988
> >>>>>> Standard Cell Placement and Global Routing Program
> >>>>>> Authors: Carl Sechen, Bill Swartz
> >>>>>>     Yale University
> >>>>>> 100287679500: system.cpu: syscall write returns 172
> >>>>>> 100287726500: system.cpu: pc 4832260836 syscall close called
> >>>>>> w/arguments 1,4831383888,172,0
> >>>>>>
> >>>>>> " MCF SYSCALL TRACE "
> >>>>>>
> >>>>>>>> 100519102000: system.cpu: syscall read called w/arguments
> >>>>>>>> 3,5368799240,8192,7
> >>>>>>>> 100519102000: system.cpu: syscall read returns
> >>>>>>>> 18446744073709551615
> >>>>>>>> 100521401500: system.cpu: syscall obreak called w/arguments
> >>>>>>>> 5374902272,0,0,1048576
> >>>>>>>> 100521401500: global: Break Point changed to: 0X1405E8000
> >>>>>>>> 100521401500: system.cpu: syscall obreak returns 5374902272
> >>>>>>>> 100521680500: system.cpu: syscall close called w/arguments
> >>>>>>>> 0,4831387472,1,1048576
> >>>>>>>> 100521680500: system.cpu: syscall close returns 0
> >>>>>>>> 100521846000: system.cpu: syscall write called w/arguments
> >>>>>>>> 1,5368778616,119,1048576
> >>>>>>>> 100521846000: system.cpu: syscall write returns 119
> >>>>>>>> 100521893000: system.cpu: syscall close called w/arguments
> >>>>>>>> 1,4831387472,119,0
> >>>>>>>> 100521893000: system.cpu: syscall close returns 0
> >>>>>>>> 100522014000: system.cpu: syscall close called w/arguments
> >>>>>>>> 2,4831387472,0,1048576
> >>>>>>>> 100522014000: system.cpu: syscall close returns
> >>>>>>>> 18446744073709551615
> >>>>>>>> 100522187500: system.cpu: syscall close called w/arguments
> >>>>>>>> 3,4831387472,1,1048576
> >>>>>>>> 100522187500: system.cpu: syscall close returns 0
> >>>>>>>> 100522357000: system.cpu: syscall obreak called w/arguments
> >>>>>>>> 5368815616,0,0,1048576
> >>>>>>>> 100522357000: global: Break Point changed to: 0X14001A000
> >>>>>>>> 100522357000: system.cpu: syscall obreak returns 5368815616
> >>>>>>>> 100522623500: system.cpu: syscall sigprocmask called w/
> >>>>>>>> arguments
> >>>>>>>> 1,18446744073709547831,0,0
> >>>>>>>> warn: ignoring syscall sigprocmask(1,
> >>>>>>>> 18446744073709547831, ...)
> >>>>>>>> 100522623500: system.cpu: syscall sigprocmask returns 0
> >>>>>>>> 100522711000: system.cpu: syscall exit called w/arguments
> >>>>>>>> 18446744073709551615,5368739848,2,0
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>>>>> m5-users mailing list
> >>>>>>>> m5-users@m5sim.org
> >>>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
> >>>>>>>>
> >>>>>>>>
> >>>>>> _______________________________________________
> >>>>>> m5-users mailing list
> >>>>>> m5-users@m5sim.org
> >>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
> >>>>>>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> m5-users mailing list
> >>>>> m5-users@m5sim.org
> >>>>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
> >>>>>
> >>> _______________________________________________
> >>> m5-users mailing list
> >>> m5-users@m5sim.org
> >>> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
> >>>
> >>>
> >>
> >>
> >
> > _______________________________________________
> > m5-users mailing list
> > m5-users@m5sim.org
> > http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
> >
>
> _______________________________________________
> m5-users mailing list
> m5-users@m5sim.org
> http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
>
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to