On 25/04/2013, at 3:47 AM, Duncan Coutts wrote:

> It looks like fold and unfold fusion systems have dual limitations:
> fold-based fusion cannot handle zip style functions, while unfold-based
> fusion cannot handle unzip style functions. That is fold-based cannot
> consume multiple inputs, while unfold-based cannot produce multiple
> outputs.

Yes. This is a general property of data-flow programs and not just compilation 
via Data.Vector style co-inductive stream fusion, or a property of 
fold/unfold/hylo fusion. 


Consider these general definitions of streams and costreams.

-- A stream must always produce an element.
type Stream a   = IO a

-- A costream must always consume an element.
type CoStream a = a -> IO ()


And operators on them (writing S for Stream and C for CoStream).

-- Versions of map.
map     :: (a -> b) -> S a -> S b    (ok)
comap   :: (a -> b) -> C b -> C a    (ok)

-- Versions of unzip.
unzip   :: S (a, b) -> (S a, S b)    (bad)
counzip :: C a -> C b -> C (a, b)    (ok)
unzipc  :: S (a, b) -> C b -> S a    (ok)

-- Versions of zip.
zip     :: S a -> S b -> S (a, b)    (ok)
cozip   :: C (a, b) -> (C a, C b)    (bad)
zipc    :: C (a, b) -> S a -> C b    (ok)



The operators marked (ok) can be implemented without buffering data, while the 
combinators marked (bad) may need an arbitrary sized buffer.

Starting with 'unzip', suppose we pull elements from the first component of the 
result (the (S a)) but not the second component (the (S b)). To provide these 
'a' elements, 'unzip' must pull tuples from its source stream (S (a, b)) and 
buffer the 'b' part until someone pulls from the (S b).

Dually, with 'cozip', suppose we push elements into the first component of the 
result (the (C a)). The implementation must buffer them until someone pushes 
the corresponding element into the (C b), only then can it push the whole tuple 
into the source (C (a, b)) costream.


The two combinators unzipc and zipc are hybrids:

For 'unzipc', if we pull an element from the (S a), then the implementation can 
pull a whole (a, b) tuple from the source (S (a, b)) and then get rid of the 
'b' part by pushing it into the (C b). The fact that it can get rid of the 'b' 
part means it doesn't need a buffer.

Similarly, for 'zipc', if we push a 'b' into the (C b) then the implementation 
can pull the corresponding 'a' part from the (S a) and then push the whole (a, 
b) tuple into the C (a, b). The fact that it can get the corresponding 'a' 
means it doesn't need a buffer.

I've got some hand drawn diagrams of this if anyone wants them (mail me), but 
none of it helps implement 'unzip' for streams or 'cozip' for costreams. 



> I'll be interested to see in more detail the approach that Ben is
> talking about. As Ben says, intuitively the problem is that when you've
> got multiple outputs so you need to make sure that someone is consuming
> them and that that consumption is appropriately synchronised so that you
> don't have to buffer (buffering would almost certainly eliminate the
> gains from fusion). That might be possible if ultimately the multiple
> outputs are combined again in some way, so that overall you still have a
> single consumer, that can be turned into a single lazy or eager loop.


At least for high performance applications, I think we've reached the limit of 
what short-cut fusion approaches can provide. By "short cut fusion", I mean 
crafting a special source program so that the inliner + simplifier + 
constructor specialisation transform can crunch down the intermediate code into 
a nice loop. Geoff Mainland's recent paper extended stream fusion with support 
for SIMD operations, but I don't think stream fusion can ever be made to fuse 
programs with unzip/cozip-like operators properly. This is a serious problem 
for DPH, because the DPH vectoriser naturally produces code that contains these 
operators.

I'm currently working on Repa 4, which will include a GHC plugin that hijacks 
the intermediate GHC core code and performs the transformation described in 
Richard Water's paper "Automatic transformation of series expressions into 
loops". The plugin will apply to stream programs, but not affect the existing 
fusion mechanism via delayed arrays. I'm using a cut down 'clock calculus' from 
work on synchronous data-flow languages to guarantee that all outputs from an 
unzip operation are consumed in lock-step. Programs that don't do this won't be 
well typed. Forcing synchronicity guarantees that Waters's transform will apply 
to the program.

The Repa plugin will also do proper SIMD vectorisation for stream programs, 
producing the SIMD primops that Geoff recently added. Along the way it will 
brutally convert all operations on boxed/lifted numeric data to their unboxed 
equivalents, because I am sick of adding bang patterns to every single function 
parameter in Repa programs. 

Ben.



_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to