On Mon, Sep 25, 2000 at 08:15:08PM -0000, Perl6 RFC Librarian wrote:
> How's this different from, for instance, generalizing source filters?
> Well, that's how I first tried to implement them in Perl, but line
> disciplines actually give you far, far more control over the file
> handling; your processing modules may dictate how line endings are
> parsed, whereas source filters have to go either before or after the
> data is split up into lines. Line discipline processing modules may
> alter the buffering behaviour of the stream, which you can't do in
> standard IO. (That's a hint that we're going to have to provide our own
> IO library to get these things working.)
A corollary is that by implementing source filter functionality by standard
perl IO disciplines it provides an easy way to test source filters.
Currently it's hard work testing a source filter, because the only way to
invoke the filter for testing is to run a perl script that uses it!
If your filter is actually a discipline, you just stack it on stdin and
test it with "print while (<>);" :-) [or
perl -p -e 'BEGIN {binmode (STDIN, ":+filter)}' ]
> open ($FH, "<", "japanese.euc.gz");
> binmode($FH, ":+decompress");
> binmode($FH, ":+euc_to_utf8");
> $foo = <$FH>; # This now UTF8.
Pedant point possibly (more implementation detail), but it might be worth
changing the example to :+gunzip as the gzip file format consists of a
(variable) header followed by a zip deflate data stream. You can find the
zip deflate stream within files with other formats.
How would it work if I were to implement :+gunzip as a filter that validates
the gzip header (if needs be) and then pushes :+inflate after itself to do
the actual data extraction (possibly leaving :+gunzip in place to catch any
trailing CRC. Maybe not, as :+inflate could signal EOF before physical EOF
because of data it gets)? If I understand the RFC correctly then it all
works due to the stacking nature. This is the sort of thing I'm assuming
that this is meant to make easy.
> =item Room 0: OS level
>
> This level implements buffering; it's here that the difference between,
> say, C<sysread> and C<read> becomes important. Modules in this layer
> must be added on the C<open> statement, since it controls very precisely
> how Perl looks at the data even before we read anything from it.
>
> The default behaviour is to emulate STDIO; in fact, the entirety of
> STDIO apart from splitting the input into lines (C<gets> and friends)
> gets implemented here.
>
> =item Room 1: Byte Transformations
> =item Room 2: Conversion to UTF8
> =item Room 3: Transformations on the UTF8
> =item Room 4: Records
Presumably room 4 implements (if not obsoletes) $\, $| and $/
(or provides the functionality behind handle methods such as autoflush()
in RFC129)
for input, setting $/ to
undef gives you :+slurper
"" gives you paragraph mode discipline in room 4,
"\n" (etc) gives you single or multiple character "line" endings
\42 and other references to scalars gets you records of that size.
Which is all fine for Unix and anything else with a similar view of files.
However, for the latter sv.c in sv_gets goes
#ifdef VMS
/* VMS wants read instead of fread, because fread doesn't respect */
/* RMS record boundaries. This is not necessarily a good thing to be */
/* doing, but we've got no other real choice */
bytesread = PerlLIO_read(PerlIO_fileno(fp), buffer, recsize);
#else
bytesread = PerlIO_read(fp, buffer, recsize);
#endif
so if $/ were implemented as a room 4 discipline how does it change room 0?
[So this is why the sig was "VMS must die" :-)]
Presumably if VMS users want record semantics they stack a different room 0
discipline, and we remove this specific overloading of $/
I would assume that every output operation calls room 4 discipline with some
bytes. If $| is 0 then it has the option of accumulating data to reduce
calls up the stack. If $| is 1 then this pushes to room 3 immediately.
Potentially there's a problem here - if I were writing a discipline I'd be
tempted to buffer data I got. Should output disciplines have some sort of
"push" flag? (Are there any others?) For example zlib's deflate() can be
called with Z_SYNC_FLUSH which outputs all internally buffered data
immediately, at the expense of lower compression than might be obtained by
hanging onto some data for a time. Giving output disciplines a push flag
means that my deflater discipline can be told by the application whether
time or space is the priority, but the downside is that the code is
(marginally) more complex.
sfio does disciplines in an undivided stack (none of these rooms). For input
you can view data flow as
read() -> buffer -> application
read() -> discipline0 -> discipline1 -> buffer -> application
For sfio there's no buffering between the lowest discipline and the OS calls.
This makes sense if you're assuming that most of the disciplines are
disciplined enough to read blocks of data from the OS, whereas the
application is written in C and enjoys parsing files using loops with
multiple getc()s which (as macros) the preprocessor inlines.
However, in perl it's more likely that programs will slurp a line of data
into a scalar before worrying about what that scalar contains using a
regexp. Hence it makes more sense to have the buffer at the OS end of the
stack. Also, it means that if a platform doesn't give any way of read()ing
files apart from fread(), we can at least take advantage of the buffering we
get at this level. (Was I right in thinking that a Win32 port of sfio()
would be problematic because the lowest level read is fread(), and read() is
emulated in terms of that? I don't know Win32 and I like life like that. I'm
hoping that a sane platform's fread() for large block sized and block
aligned reads should be nearly as fast as the underlying OS read)
[In fact, the above diagram demonstrates a somewhat annoying bug, in that if
you stack a discipline after some data have been read, any buffered bytes
(read from the OS, but not yet by the application) end up downstream of your
discipline, and you never get a chance to munge them. Which is a pain if
you're trying to put the zlib inflater on because you've just validated a
gzip header at the application level]
Nicholas Clark
PS Yes I *am* obsessed with deflating and inflating stuff. Preferably
without stuff noticing what I'm up to.