This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Modify open() to support FileObjects and Extensibility

=head1 VERSION

   Maintainer: Nathan Wiger <[EMAIL PROTECTED]>
   Date: 04 Aug 2000
   Last-Modified: 11 Aug 2000
   Version: 3
   Mailing List: [EMAIL PROTECTED]
   Number: 14

=head1 ABSTRACT

Currently, C<open()>, C<opendir()>, C<sysopen()>, and other file open
functions are given handle arguments, whose values are twiddled if a
filehandle can be created:

   open(HANDLE, "<$filename");
   open PIPE, "| $program";
   open my $fh, "<$filename";
   opendir DIR, $dirname;
   sysopen(HANDLE, $filename, O_RDWR|O_CREAT, 0666);

There are several problems with this approach:

   1. The calling style is uncharacteristic of other Perl funcs
   2. There is no way to support a list of return values
   3. There is no way to overload or extend them

In order to make these functions more consistent with other constructor
like functions (i.e. new(), etc), they should be changed to instead
return B<first-class fileobjects>:

   $fo = open "<$filename" or die;
   $po = open "|$program" or die;
   $do = open dir $dirname or die;
   $so = open sys $filename, O_RDWR|O_CREAT, 0666;

This would make these functions more internally consistent within Perl,
as well as allowing for the power of true B<fileobjects> and an
extensibile C<open()>.

=head1 DESCRIPTION

=head2 Overview

First, this RFC assumes that B<fileobjects> will be $ single-whatzitz
(thanks Tom) types, which seems to have reached an overall informal
consensus.

As many have observed, the current filehandle mechanism is insufficient
and largely "a historial accident". The goal of the redesign of file
handles into full-fledged B<fileobjects> is to make them as flexible and
powerful as other objects within Perl, which still retaining a means to
interact with them simply. Since we are redesigning filehandles to be
true B<fileobjects>, we should revise their constructor functions as
well, returning B<fileobjects> and providing extensibility.

As shown above, in the simplest case this would change the C<open()>
function to:

   $fo = open $filename or die;

If successful, C<open()> and its relatives will return B<fileobjects>.
On failure, they will return undef. This still allows the user to test
the return value of C<open()> (as shown above) by checking for a "true"
(C<fileobject>) or "false" (undef) condition.

=head2 New Syntax of C<open()>

The syntax of the C<open()> function would be changed as follows:

   $fileobject, [ @params ] = open [ $handler ] $file, [ @args ];

Let's examine this syntax more closely:

   $fileobject  -  Replacement object for current filehandles.

   @params      -  Optional parameters that may be returned in a list
                   context. These may be things such as the owner for
                   a true file, or the content-type for a web document.

   $handler     -  The class from which to load the appropriate file
                   methods, the default being "file". This is not 
                   really a class, but rather a registered handler. This
                   handler type is bound to a class given by the user,
                   or taken from a set of core methods. Think Apache
                   handler.               

   $file        -  File to open. This might be a real file or directory,
                   but might also be a website, port for a socket,
                   ftp server, ipc pipe, rpc client, and so on.

   @args        -  Optional arguments to pass to the handler's C<open>.

The C<open> function, as I propose it, is an overloaded and extensible
function that differs from other constructors in that it returns a valid
B<fileobject>. This object can then be used in C<read>, C<print>, and
other such file functions. As such, development of classes that can 
handle objects on new platforms (ex: Mac, Palm) and handle new types of
files (XML documents, etc) is much quicker. Plus, these modules are much
lighter weight since they don't have to reinvent the wheel every time.

=head2 Simple Scalar Form

In the simplest, "looks like Perl 5" form, C<open()> can take one file
parameter, which is then opened per the descriptor provided and the
corresponding B<fileobject> returned. Here are some examples (note that
C<my> has been left out for clarity):

   # Read from a file
   $fo = open "</etc/passwd" or die;
   print while (<$fo>);
   close $fo;

   # Write a file to a pipe
   $mailpipe = open "|/usr/lib/sendmail" or die;
   ($motd, $owner) = open "</etc/motd";       # return owner in list
   die unless $owner == 0;                    # owner not root
   while (<$motd>) {
      print $mailpipe;
   }
   close $motd;

   # Go fork yourself
   ($myself, $pid) = open "-|" or exec 'ls';  # return PID in list
   print while (<$myself>);
   close $myself;   # not myself anymore, hah! ;-)

In addition, the C<$file> argument becomes optional in this new syntax.
If not supplied, it defaults to C<$_>, making it consistent with other
Perl functions:

   for (@filenames) {
      my $fo = open;      
      push @open_handles, $fo;
   }

   # ... stuff happens ...

   for (@open_handles) {
      close;
   }

Perhaps this specific example is ugly (and useless), but there are
probably other situations in which one could take advantage of this.

=head2 True First-Class FileObjects

One major limitation of Perl's current filehandles is that they are
bareword scalars, with no object properties or power. The redesign of
simple filehandles into first-class B<fileobjects> allows us to give
them full object-oriented power, while still allowing them to be used
in a simple manner as shown above. Each object can contain methods to
allow us to access features of that B<fileobject> much more efficiently.

Here are some proposed default accessor methods of B<fileobjects>. Each
of these would return the appropriate value, or undef if not available.
This is a brief listing; the intent would be to support all of the
current C<FileHandle> methods and then some. In a string context, the
filename as opened (including descriptors like | and <) is returned.

   $fo->STRING    -  Same as $fo->filename
   $fo->filename  -  Name of the file, web document, port, etc
   $fo->type      -  One of 'pipe', 'file', 'ipc', etc, ala want()
   $fo->mode      -  Way the file was opened (|,<,>+,etc)
   $fo->fileno    -  System file number
   $fo->dup       -  Returns a duplicate of the current B<fileobject>
   $fo->pid       -  Return current PID of the process (if pipe/fork)

In addition, these functions would allow you to modify key elements of
the B<fileobject>:

   $fo->autoflush -  Sets buffer flushing policy
   $fo->untaint   -  Removes tainting from that data source
   $fo->options   -  Some syscalls, like C<socket()>, allow you to set
                     options which affect the handling of C<$fo>

If we decide that B<fileobjects> should be persistent across C<close()>
operations, we could define the following functions:

   $fo->open      -  Object methods to open/close C<fileobjects>
   $fo->close
   $fo->is_open   -  1 or undef, depending on the state of the object
   $fo->is_closed

Why would a B<fileobject> be persistent across C<close()> operations? If
it contains lots of properties, it may be a waste if we simply want to
close it to make sure buffers are flushed or bandwidth is not wasted on
TCP connections. We could use C<close()> to flush buffers and tidy
things up, but not destroy the object until the end of the script or
C<undef $fo> was called (similar to C<FileHandle>).

=head2 Extensible Handler Bindings

In addition to the standard file form, C<open()> can also now take an
optional handler name (the default is "file") from which to load the
appropriate methods. This gives us easy access to methods that open
Directory, Socket, HTTP, FTP, or other types of files, meaning we no
longer have to start from the ground up every time we want to open a new
type of "file".

Here, C<open handlers> work much like Apache handlers. You specify a
certain type of handler (for example, "file", "dir", "http", "ftp", and
so on) as what should be used to operate on the given argument. The only
requirement is that the handler return one of two values:

   1. A valid C<fileobject> which contains certain mandatory
      object methods (exact methods yet to be hammered out).

   2. undef if the file can't be handled, which allows stacked
      handlers.

As such, when a user specifies something like:

   # Open a dir
   $dir = open dir "/usr/bin";

Then the "dir" handler is called by Perl's standard indirect object
notation:

   $dir = dir->open("/usr/bin");

However, "dir" would not correspond directly to a class, but rather an
event handler. To paraphrase Tom Hughes's great explanation of this, the
core should provide a way for modules to register themselves as being
handlers for certain files. For example, you might load
C<LWP::UserAgent>, which would register as being a valid "http" handler.
Or, you might load C<Net::FTP>, which would register itself as being an
"ftp" handler. For example:

   use LWP::UserAgent;
   register open, 'http' => LWP::UserAgent::new;   # first stab
   $web = open http "http://www.yahoo.com", 'GET'; 

The C<register> line gives an example of what might happen. This might
be specified by the user (as shown), or might be down automatically by
loading modules under a certain part of the lib tree. The best mechanism
for this should probably be addressed in a separate RFC, since it has
far more general applications than just for C<open>.

These handlers should be stackable. If an C<open> handler returns undef,
then the next one in line tries to open it. If no handler can open a
specific file, then undef is returned from C<open> to the user,
consistent with current behavior.

For example, we might might stack "dir" handlers so that only certain
users can look at the contents of "/usr/sbin", for example:

   use Unix::OnlyRootSeesUsrSbin;   
   register open, 'dir' => Unix::OnlyRootSeesUsrSbin::open;
   $dir = open dir "/usr/sbin";    # not root? no /usr/sbin for you!
  
Here are some more examples which could be use to provide essentially
native file access for many different media:

   # Open a file
   # Note "file" is the default class so doesn't have to be specified
   $motd = open ">/etc/motd" or die;
   print $motd @data;
   close $motd;

   # Open a directory 
   $dir = open dir "/usr/bin";
   @files = grep !/^\..*/, <$dir>;  # no more readdir
   close $dir;                      # no more closedir

   # Open a client socket with IO::Socket
   # By overloading < and > we can do clients and servers!
   $socket = open socket "< 25", PF_INET, SOCK_STREAM, TCP;
   @input = <$socket>;
   close $socket;
   do_something(@input);

   # Open a remote webpage
   $http = open http "http://www.perl.com/", GET;
   @doc = <$http>;
   print @doc if $http->content_type eq 'text/html';
   close $http;

   # Open an ftp connection
   $ftp = open ftp "ftp.perl.com";
   $ftp->cwd('CPAN') or die;
   @files = <$ftp>;          # overloading as dir
   close $ftp; 

   # Open a file like sysopen()
   # print $sysfile automatically calls $sysfile->print
   $sysfile = open sys "/etc/issue", O_RDWR|O_CREAT, 0644;
   print $sysfile "Hello world!", $buf, $len, $offset;

The advantages to using such extensible <open handlers> are twofold:

   1. No more reinventing the wheel to support new file
      types on different platforms or networks.

   2. Easy integration so that <> can address files, dirs,
      pipes, websites, ipc communications, sockets, rpc...

Assuming that these handlers all agree to return C<fileobjects> with a
consistent set of properties, this could lead to great optimizations
since the structure of these C<fileobjects> will be known ahead of time. 

=head2 Custom File Methods

In addition, handlers can provide a custom set of C<read>, C<print>, and
other functions that do different things from the core Perl set. So, a
Unix/NT integration module might include a C<print> method that delimits
strings differently based on filesystem. Or, a module might provide a
custom C<close> method that does special cleanup before closing a file.

This is all accomplished automatically by Perl's indirect object
notation. All of a given C<open> module's functions would inherit from
the core file methods. For example:

   package MyHttpOpen;
   
   sub open {
       # do stuff...
       return $new_http_fileobject;
   }

   sub print {
       # overrides CORE::print()
       # maybe it PUTs or POSTs data
   }

Then, in your script, you would say:

   use MyHttpOpen;
   register open, 'http' => MyHttpOpen;
   $post = open http "http://upload.mydomain.com/upload.cgi", POST;
   print $post @filedata;     # calls $post->print
   $reply = <$post>;          # calls CORE::read

In this example, a custom C<print> is provided. However, a corresponding
C<read> is not provided, so C<CORE::read> is used (which should actually
be C<SUPER::read> assuming inheritance is setup correctly) .

=head1 IMPLEMENTATION

=head2 Syntax Changes

The C<open> function would have to be altered to accept this new syntax
and return a full-fledged B<fileobject>. 

The C<close> function would remain basically unchanged, acting on the
B<fileobject> (or C<$_> if none is specified).

A method must be developed for registering classes as event handlers.
This has many more general applications than just for C<open>, and so
should be addressed in a separate RFC. For example, this process might
be able to be unified across C<open>, signals, alarms, and so on.

=head2 Function Deletions

Because this syntax is flexible enough to handle B<any> type of file
operation, the following functions should be removed from Perl 6:

   # Replaced by 'open dir'
   opendir
   readdir         # dir->read instead
   closedir        # close instead
   seekdir         # dir->seek instead
   rewinddir
   telldir
   seekdir

The first four functions have easy alternatives per this RFC. The last
three do not currently. If they are not widely used (I suspect they're
not, especially since they're not implemented everywhere), they should
be moved to a compatibility module (or axed).

   # Replaced by 'open sys'
   sysopen
   sysread         # sys->read instead
   syswrite        # sys->write instead

The sys* family of file functions can, in my opinion, be safely removed
because the following alternative makes the low-level handling of files
much easier:

   $sysin = open sys "/etc/motd", O_RDONLY; 
   $sysout = open sys "/tmp/datafile", O_RDWR|O_CREAT, 0644;

   # read calls $sysin->read custom method
   $data = read $sysin, $buf, $blksize;

   # print calls $sysout->print, alias to $sysout->write
   print $sysout "Hello, world!", $buf, $len, $offset;

My own experience is that if a file is C<sysopen>'ed, usually C<sysread>
and C<syswrite> are used on it (and vice versa). As such, the above
seems more natural and consistent.

   # Replaced by 'open socket'
   socket
   setsockopt      # socket->options
   connect         # overload <> or socket->connect
   shutdown        # close instead   

While "open socket" is admittedly 5 characters more than just "socket",
these are made up for by the fact that "close" is 3 characters shorter
than "shutdown", and "options" is another 3 shorter than "setsockopt".
So, you're actually 1 character ahead! :-)

=head2 Performance

In order to prevent performance hits, anything that is packaged as a
default file type (such as files, pipes, directories, sockets, ipc,
and so on) must be highly optimized for interaction with this new
version of C<open()>. Basically, anything living under the C<IO::>
tree should be ripped apart and put back together again, or replaced
with a new version under C<Open::> or something similar.

=head1 CHANGES

   1. Modified C<open> syntax to use indirect object notation

   2. Added the concept of an C<open handler> and proposals for
      how handlers could be stacked and how classes could register
      handlers

   3. Expanded the list of functions to remove

=head1 REFERENCES

RFC 33: Eliminate bareword filehandles

RFC 30: STDIN, STDOUT, and STDERR should be renamed 

RFC 49: Objects should have builtin stringifying STRING method 

Tom Christiansen's great analysis of file object methods

Tim Jenness's suggestion to use optimized IO objects for all I/O

Nick Ing-Simmon's suggestion to hide IO:: classes from the user

Tom Hughes's formalization of a way to register open() handlers

Chaim's suggestion of a way to register classes in Perl 6

Everyone else on perl6-language-io, you guys are great! Thanks. :-)


Reply via email to