Re: [Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Alex Jacobson Thu, 31 May 2007 10:29:16 -0700

The HAppS HTTP code basically delivers the first 64k and a handle toacquire the rest. The 99% or higher case is that the document fits inmemory so the 64k bound is fine. If you have something bigger, theuser is going to have to decide how to handle that on a case by case basis.

Note: chunk-encoding means that there is no theoretical limit to how bigan HTTP request or response may be.


-Alex-

Jules Bean wrote:

I've been having something of a discussion on #haskell about this but
I had to go off-line and, in any case, it's a complicated issue, and I
may be able to be more clear in an email.

The key point under discussion was what kind of interface the HTTP
library should expose: synchronous, asynchronous? Lazy, strict?

As someone just pointed out, "You don't like lazy IO, do you?". Well,
that's a fair characterisation. I think unsafe lazy IO is a very very
cute hack, and I'm in awe of some of the performance results which
have been achieved, but I think the disadvantages are underestimated.

Of course, there is a potential ambiguity in the phrase 'lazy IO'. You
might interpret 'lazy IO' quite reasonably to refer any programming
style in which the IO is performed 'as needed' by the rest of the
program. So, to be clear, I'm not raising a warning flag about that
practice in general, which is a very important programming
technique. I'm raising a bit of a warning flag over the particular
practice of achieving this in a way which conceals IO inside thunks
which have no IO in their types: i.e. using unsafeInterleaveIO or even
unsafePerformIO.

Why is this a bad idea? Normally evaluating a haskell expression can
have no side-effects. This is important because, in a lazy language,
you never quite know[*] when something's going to be evaluated. Or if
it will. Side-effects, on the other hand, are important things (like
launching nuclear missiles) and it's rather nice to be precise about
when they happen. One particular kind of side effect which is slightly
less cataclysmic (only slightly) is the throwing of an exception. If
pure code, which is normally guaranteed to "at worst" fail to
terminate can suddenly throw an exception from somewhere deep in its
midst, then it's extremely hard for your program to work out how far
it has got, and what it has done, and what it hasn't done, and what it
should do to recover. On the other hand, no failure may occur, but the
data may never be processed, meaning that the IO is never 'finished'
and valuable system resources are locked up forever. (Consider a naive
program which reads only the first 1000 bytes of an XML document
before getting an unrecoverable parse failure. The socket will never
be closed, and system resources will be consumed permanently.)

Trivial programs may be perfectly content to simply bail out if an
exception is thrown. That's very sensible behaviour for a small
'pluggable' application (most of the various unix command line
utilities all bail out silently or nearly silently on SIGPIPE, for
example). However this is not acceptable behaviour in a complex
program, clearly. There may be resources which need to be released,
there may be data which needs saving, there may be reconstruction to
be attempted on whatever it was that 'broke'.

Error handling and recovery is hard. Always has been. One of the
things that simplifies such issues is knowing "where" exceptions can
occur. It greatly simplifies them. In haskell they can only occur in
the IO monad, and they can only occur in rather specific ways: in most
cases, thrown by particular IO primitives; they can also be thrown
'To' you by other threads, but as the programmer, that's your
problem!.

Ok. Five paragraphs of advocacy is plenty. If anyone is still reading
now, then they must be either really interested in this problem, or
really bored. Either way, it's good to have you with me! These issues
are explained rather more elegantly by Oleg in [1].


So, where does that leave the HTTP library? Well here are the kinds of
interface I can imagine. I'm deliberately ignoring all the stuff about
request headers, request content, and imagining that this is all about
URL -> ByteString. Here are the options that occur to me:

A. Strict, synchronous GET
   sSynGET :: URL -> IO ByteString

   Quite simply blocks the thread until the whole data has
   arrived. Throws some kind of exception on failure, presumably. This
   is a simple primitive, appropriate for relatively small files
   (files which fit comfortably in your memory) and simple
   programs. It's also great for programs which want to take their own
   control over the degree of asynchrony; they can just fork as many
   threads as they choose to GET with.

B. Strict, asynchronous GET
   sAsynGET :: URL -> IO (MVar ByteString)

   Download the entire data, but do it in a separate thread. Give me
   an MVar so I can synchronise on the data's arrival in whichever way
   suits my program best. Suitable for small files which fit
   conveniently in memory. Very easy to implement in terms of forkIO
   and sSynGET so really it's a convenience function.

C. Strict, synchronous, GET-to-file
   sSynFileGET :: URL -> FilePath -> IO ()
D. Strict, asynchronous, GET-to-file
   sAsynFileGET :: URL -> FilePath -> IO (MVar ())

   Download the entire data to a local file. This means that it
   doesn't matter if the data is far bigger than local memory, it can
   still be done efficiently. [Note that this doesn't mean it must use
   lazy getContents magic under the hood. It could easily use strict
   hGet in reasonable sized chunks and write them out straightaway.]
   The only difference between the two variants is that one keeps
   control until completion, the other gives you an MVar which you can
   block on if/when you choose.  This method is appropriate for
   clients which need extremely large data, and don't mind waiting for
   it to finish before they start processing. It is also appropriate
   for clients which want random access to large data requests (using
   underlying hSeek-based random file IO, once the file has downloaded).

E,F. Progressive GET
   pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO ()
   pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())

   (This is a particular simple case of Oleg's iteratees, I
   think) Download the data at whatever speed is convenient. As data
   arrives, feed it to the 'callback' provided. The ByteString is the
   new chunk of data, the 'Bool' is just supposed to indicate whether
   or not this is the final chunk. You can imagine slight
   variations. Note that the library promises not to retain any
   references to the ByteString chunks, so if the callback processes
   them and then discards them they are eligible for garbage
   collection. If the callback wishes to accumulate them, it can quite
   easily 'append' the strict chunks into a lazy bytestring, which is
   an efficient operation. This is suitable for applications which
   wish to do something like progressive display of a pJPEG, or
   compute a summary function, or perhaps even display an animation,
   although that last would normally need a little more structure to
   try to guarantee the data rate.

   Incidentally there are more complex options than (Bool,Bytestring)
   -> IO ().  A simple and obvious change is to add a return
   value. Another is a 'state monad by hand', as in (Bool,Bytestring)
   -> s -> s, and change the final return value of the type to IO s,
   which allows the callback to accumulate summary information and
   still be written as pure code. Other options allow the 'callback'
   to request early termination, by layering in an 'Either' type in
   there. Another more sophisticated option, I think, is the higher
   rank

   MonadTrans t => URL ->
                    ((forall m. Monad m) => (Bool,ByteString) -> t m)
           -> t IO ()

   ...which, unless I've made a mistake, allows you to write in 'any
   monad which can be expressed as a transformer', by transforming it
   over IO, but still contains the implicit promise that the
   'callback' does no IO. For example t = StateT reduces to the
   earlier s -> s example, in effect, with a slightly different data
   layout.

   Another couple of refinements to the above are that in practice you
   normally want to 'guarantee' your callback only runs on a chunk
   size of at least X (in some cases 'exactly X'), or you want to
   guarantee it's called at least every Y seconds. Neither of these
   are hard to add.

Given these three pairs of options, what need is there for an unsafe
lazy GET?  What niche does it fill that is not equally well filled by
one of these?

Program conciseness, perhaps. The kind of haskell oneliner whose
performance makes us so (justly) proud. In isolation, though I don't
find that a convincing argument; not with the disadvantages taken also
into account.

The strongest argument then is that you have a 'stream processing'
function, that is written 'naively' on [Word8] or Lazy ByteString, and
wants to run as data is available, yet without wasting space. I'm
inclined to feel that, if you really want to be able to run over 650M
files, and you want run as data is available, then you in practice
want to be able to give feedback to the rest of your application on
your progress so far; I.e, L.Bytestring -> a is actually too simple a
type anyway.

I'm interested to know what opinions other people have on this,
whether I've made any serious logic mistakes or just overlooked
another approach which has advantages. Having spent quite a while this
evening thinking this over, I just don't see the convincing case for
the unsafe lazy approach, and I see plenty of problems with it...

Cheers,

Jules


* Well, OK. If you're smart, then you know. Because you're smart, and
  you thought about it carefully. But most of the time, you don't need
  to know. And if you produce data (a pure structure) which is then
  consumed by a library - or vice versa - then you don't know unless
  you read their code.

[1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html


_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Reply via email to