casefile random access (was: Re: PSPP conference call notes.)

Ben Pfaff Sat, 03 Jun 2006 11:34:15 -0700

John Darrington <[EMAIL PROTECTED]> writes:
> On Thu, Jun 01, 2006 at 07:52:45PM -0700, Ben Pfaff wrote:
>      John Darrington <[EMAIL PROTECTED]> writes:
>      >      - Need casefile bookmarking/cloning.  Should be easy; I'll take
>      >        care of it soon.
>      >
>      > It will also enable (I hope) integration of casefiles into the GUI.
>      > There's a fundamental problem here:  The GUI needs to scroll through
>      > the cases.  Scrolling forwards is not a problem.  Scrolling backwards
>      > is something that casefiles don't allow.  My idea is to cache a small
>      > number of cases (about twice the number that fits in the window) which
>      > allows for small magnetude scrolling.  Scrolling back through larger
>      > offsets is where the bookmark would come into play.
>      
>      But I can easily add random access for the GUI to use.  That's a
>      situation where it makes perfect sense to support random access.
>
> How would random access cope if somebody uses the GUI to open a system
> file with a *HUGE*  number of cases? It would have to swap back and
> forth to disk. Perhaps we should discuss this in a seperate thread.


OK.  New thread.

If you're using a casefile, then it's backed either by an array
of cases in memory or by a disk file in the casefile format.
Random access in an array is trivial.  A disk file can be read
sequentially or randomly.  We currently do only sequential
access.  Adding random access wouldn't change that: procedures
would still read the casefile sequentially.

Scrolling through a casefile in a GUI requires a low rate of disk
access.  If your cases have 1,000 variables each, then that's 8
kB per case.  If you display 100 of those cases at a time, then
that's 800 kB of data.  Seeking to a location on disk and reading
800 kB of data might take 0.1 second on a modern desktop machine.
It's going to take a lot longer than that for a user to look over
that data.  (Sometimes users just hold down PgUp or PgDn to skip
past lots of data, but you don't really have to display all of it
because the users can't read that fast anyway.  You might as well
skip reading or displaying some pages if the disk can't keep up.)

In short, I think that random access for interactive usage is
fine.

Now I'll suppose the opposite approach.  You set a bookmark, say,
1,000 cases back from the current position.  When the user hits
PgUp a couple of times (say skipping past 200 cases, past your
small number of cached cases) you have to jump back to the
bookmark and read forward sequentially.  What's going to happen
inside the casefile in this case?  Well, it's going to read cases
from memory, if the casefile is in memory, in which case the
situation is uninteresting.  If it's on disk it's going to seek
back in the casefile and read forward.  In other words, it's
going to do exactly the same thing as in random access, except
that it's going to have to read the cases in between as well.
Instead of reading 100 cases 300 cases back, it'll read 800 cases
1000 cases back.  There's no benefit in that; you might as well
just do what you want to do, which is to just read those 100
cases.

There's another issue here.  All this assumes that your data is
in a casefile.  But you're really talking about a system file,
which is a different beast.  To do what I'm talking about above,
you'd have to copy the system file's data in a casefile.  This
would double the disk space needed (the original system file plus
a copy in a disk-based casefile).  What you might really want is
to be able to operate directly on the system file's content.  The
system file interface doesn't support random access, but it
could, just as the casefile interface could.  At least, it could
easily for non-compressed system files; it would require extra
time or extra space to support random access in compressed system
files (because there's no way to know where to seek to).

I'll conclude by adding a final, forward-looking issue.
Currently, every procedure reads the active file from somewhere,
such as a system file or casefile, transforms it, and then writes
the transformed version to a casefile[*].  That is, it always
makes a new copy (and if the old version is a casefile, throws
away the old copy).  (In SPSS syntax, this is equivalent to
always running the CACHE utility.)  But, as you've pointed out
before, this is wasteful; it is usually[+] possible to avoid
writing a new copy, if you just retain the old transformations
and re-apply them to the old data, followed by any new
transformations, on the next procedure.  I'm planning to
implement this relatively soon.  But after that, there's no
obvious place for the data viewer window to get its data from,
because the data it wants to show is not actually stored
anywhere; it's just defined in terms of a source file plus a
bunch of transformations.  I'm not sure what we'll want to do
about that; one option would be to, when using the GUI, always
write out a new copy.

[*] A few unusual procedures (e.g. FLIP) don't write the output
to a casefile.

[+] Some procedures' effects cannot be represented as a
transformation, e.g. SORT, AGGREGATE.
-- 
"[Modern] war is waged by each ruling group against its own subjects,
 and the object of the war is not to make or prevent conquests of territory,
 but to keep the structure of society intact."
--George Orwell, _1984_


_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

casefile random access (was: Re: PSPP conference call notes.)

Reply via email to