Re: File.seek() interface

Larry Wall Thu, 07 Jul 2005 17:59:09 -0700

On Thu, Jul 07, 2005 at 02:15:19PM -0600, Paul Seamons wrote:
: > We should approach this from the perspective that $fh is an iterator, so
: >   the general problem is "how do we navigate a random-access iterator?".
: 
: Well - I kind of thought that $fh was a filehandle that knew how to behave 
: like an iterator if asked to do so.


Yes, basically.  And they fall into that class of iterators that may
or may not know how to back up, so it may be quite possible to seek forward
10 items but not backward 10 items, if "item" is, for example, a line
defined by an asymmetric match rule.

: There are too many applications that 
: need to jump around using seek.

We need to have a POSIXly correct layer, but that's no reason not to have
other layers on top of that with more useful semantics.  I view files
as just funny-looking strings, in the abstract.  So the same issues
arise that we've talked about concerning strings in Unicode, and that's
even before we get into counting lines or paragraphs.  Like a string,
a file may naturally allow itself to be viewed as bytes (POSIX), codepoints,
graphemes, and/or characters in the current language.  It can allow
multiple views into the same abstract string, but as with strings,
it may limit the minimum and maximum abstraction level you're allowed
to deal with the file.  And depending on the file/string representation,
one of the abstraction levels is likely to be very efficient to seek
around in, and others have to be emulated by visiting all the intermediate
items.  Some file structures are great at indexing into lines but lousy
at indexing into anything smaller than that.  A file position in such
a file is not even going to be an integer, but a line number plus an
offset into the line.

I realize we most of us come from the POSIXly-correct worldview
that all files are really just sequence of bytes that can always be
indexed by integer.  This view doesn't make a lot of sense any more
in the world of Unicode.  We see various versions of Unix/Linux being
caught with their pants down because there's no metadata to tell you
the character encoding of the filenames, for instance.  Perl 6 must
not fall into that trap.

In the discussion of seek(), this primarily means that you must keep
reminding yourself that file positions (and string positions) are
not necessarily numbers.  Treat them as opaque recipes for navigating
into a file, because you don't know what the most efficient underlying
representation is.  It might even be some kind of URI.

At the same time, all relative navigation *must* specify the units.
We can't simply assume bytes any more.  And if you specify navigation
in a smaller unit than the natural unit of the file/string in question,
you have to either give it a round-up or round-down instruction, or
be prepared to handle an exception of some sort.  A UTF-8 handler has
the nice property that it can tell if it has landed in the middle of
a character, but it can't read your mind about what to do when that happens.

: The options that need to be there are:
:    seek from the beginning
:    seek from the end
:    seek from the current location
: 
: Now it could be simplified a bit to the following cases:
: 
:   $fh.seek(10);  # from the beginning forward 10
:   $fh.seek(-10); # from the end backwards 10

Apart from the units and allignment problem, does $fh.seek(-0) mean
the beginning or the end of the file?

:   $fh.seek(10, :relative); # from the current location forward 10
:   $fh.seek(-10, :relative); # from the current location backward 10

Again, 10 whats?  Bytes?  Codepoints?  Lines?

I think I'd actually like to divorce the notion of going to a
particular position from the notion of relative navigation.  So I'm
in favor of $fh.seek taking *only* an opaque position, and $fh.beg
and $fh.cur and $fh.end returning opaque positions.  Then there are
navigation commands that can take an opaque position and move relative
to them a given number of units, and we force the units to be specified.
Something like:

    $fh.pos = $fh.pos + 10`lines

Arguably, we could probably admit

    $fh.pos = 10`bytes

for the case of seeking from the begining.  But I'd kind of like

    $fh.pos = 10

to be considered an error.

Note also that we can treat string positions exactly the same way.
All the rule-ishly returned positions are defined as opaque objects already.

Larry

Re: File.seek() interface

Reply via email to