John Darrington <[EMAIL PROTECTED]> writes: > On Thu, Jun 01, 2006 at 07:52:45PM -0700, Ben Pfaff wrote: > John Darrington <[EMAIL PROTECTED]> writes: > > - Need casefile bookmarking/cloning. Should be easy; I'll take > > care of it soon. > > > > It will also enable (I hope) integration of casefiles into the GUI. > > There's a fundamental problem here: The GUI needs to scroll through > > the cases. Scrolling forwards is not a problem. Scrolling backwards > > is something that casefiles don't allow. My idea is to cache a small > > number of cases (about twice the number that fits in the window) which > > allows for small magnetude scrolling. Scrolling back through larger > > offsets is where the bookmark would come into play. > > But I can easily add random access for the GUI to use. That's a > situation where it makes perfect sense to support random access. > > How would random access cope if somebody uses the GUI to open a system > file with a *HUGE* number of cases? It would have to swap back and > forth to disk. Perhaps we should discuss this in a seperate thread.
OK. New thread. If you're using a casefile, then it's backed either by an array of cases in memory or by a disk file in the casefile format. Random access in an array is trivial. A disk file can be read sequentially or randomly. We currently do only sequential access. Adding random access wouldn't change that: procedures would still read the casefile sequentially. Scrolling through a casefile in a GUI requires a low rate of disk access. If your cases have 1,000 variables each, then that's 8 kB per case. If you display 100 of those cases at a time, then that's 800 kB of data. Seeking to a location on disk and reading 800 kB of data might take 0.1 second on a modern desktop machine. It's going to take a lot longer than that for a user to look over that data. (Sometimes users just hold down PgUp or PgDn to skip past lots of data, but you don't really have to display all of it because the users can't read that fast anyway. You might as well skip reading or displaying some pages if the disk can't keep up.) In short, I think that random access for interactive usage is fine. Now I'll suppose the opposite approach. You set a bookmark, say, 1,000 cases back from the current position. When the user hits PgUp a couple of times (say skipping past 200 cases, past your small number of cached cases) you have to jump back to the bookmark and read forward sequentially. What's going to happen inside the casefile in this case? Well, it's going to read cases from memory, if the casefile is in memory, in which case the situation is uninteresting. If it's on disk it's going to seek back in the casefile and read forward. In other words, it's going to do exactly the same thing as in random access, except that it's going to have to read the cases in between as well. Instead of reading 100 cases 300 cases back, it'll read 800 cases 1000 cases back. There's no benefit in that; you might as well just do what you want to do, which is to just read those 100 cases. There's another issue here. All this assumes that your data is in a casefile. But you're really talking about a system file, which is a different beast. To do what I'm talking about above, you'd have to copy the system file's data in a casefile. This would double the disk space needed (the original system file plus a copy in a disk-based casefile). What you might really want is to be able to operate directly on the system file's content. The system file interface doesn't support random access, but it could, just as the casefile interface could. At least, it could easily for non-compressed system files; it would require extra time or extra space to support random access in compressed system files (because there's no way to know where to seek to). I'll conclude by adding a final, forward-looking issue. Currently, every procedure reads the active file from somewhere, such as a system file or casefile, transforms it, and then writes the transformed version to a casefile[*]. That is, it always makes a new copy (and if the old version is a casefile, throws away the old copy). (In SPSS syntax, this is equivalent to always running the CACHE utility.) But, as you've pointed out before, this is wasteful; it is usually[+] possible to avoid writing a new copy, if you just retain the old transformations and re-apply them to the old data, followed by any new transformations, on the next procedure. I'm planning to implement this relatively soon. But after that, there's no obvious place for the data viewer window to get its data from, because the data it wants to show is not actually stored anywhere; it's just defined in terms of a source file plus a bunch of transformations. I'm not sure what we'll want to do about that; one option would be to, when using the GUI, always write out a new copy. [*] A few unusual procedures (e.g. FLIP) don't write the output to a casefile. [+] Some procedures' effects cannot be represented as a transformation, e.g. SORT, AGGREGATE. -- "[Modern] war is waged by each ruling group against its own subjects, and the object of the war is not to make or prevent conquests of territory, but to keep the structure of society intact." --George Orwell, _1984_ _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
