Re: [Factor-talk] Playing with python-like iterators (and efficiency of continuations)

Paul Moore Sat, 31 Oct 2009 16:28:06 -0700

2009/10/31 Daniel Ehrenberg <[email protected]>:
> 1. Exceptions would be a slow (in the current implementation) and
> messy way to signal the end of an iteration. Exceptions should be used
> for things that aren't likely to happen, and this is something likely.


That's what I thought. Exceptions are cheap in Python, so the usage
makes more sense there.

> You could base your design for next on the assoc protocol. at* looks
> up the value corresponding to a key in an assoc. If the key is
> present, it returns the value along with the flag t. If the key is
> absent, it returns f for the value and f for the flag. Often, code
> will call the word 'at' rather than 'at*', which is : at at* drop
> ;--just returning the value, or f if the value isn't there. But this
> only works if you expect that at* shouldn't return f for the value, or
> if you want to equate a value of f with a missing key, or you know the
> key will be there, etc. So you could have next* and next.

Yes, that sounds reasonable. As you say below, next should be a
low-level function, and not one the user would normally call directly.

> Alternatively, you could have a function empty? which tests if the
> iterator has been exhausted, together with a next-unsafe function that
> increments the iterator and is allowed to create an invalid iterator
> if it's already done. Out of these primitives, you could build next
> (or next*) to safely increment the iterator, either throwing an
> exception if it's done (if we don't think anyone's likely to call next
> on purpose on an exhausted iterator) or return a flag about whether
> it's done (if we think people are likely to call it on purpose in that
> case).

Testing for emptiness is the difficulty. As an example, consider an
iterator which repeatedly returns the result of calling a function,
until a sentinel value is returned (for a silly, but concrete example,
consider generating a random sequence of numbers in the range 0-59
until a number greater than 50 is returned). It's not possible to tell
if the iterator is "empty" without generating the next number, and if
you do that, you need to cache the generated value. That quickly gets
messy - it's how C++ input iterators work, and they are a pain to
implement...

So I really want to avoid allowing the user to check for emptiness
before requesting the value (again from Python, "it's better to ask
for forgiveness than to ask for permission"). The assoc protocol model
sounds reasonable.

> If you go with a generic protocol for iterators (which I would
> recommend for performance reasons; I wrote about this below), then you
> should choose the base functions by what's easiest to implement, and
> then expose a set of functions on top of that which turn out to be the
> most useful. Probably, the situation where you call 'next' directly at
> all is rare, compared to situations where you want to call 'each'.

Precisely.

> 2. Using coroutines for this is very costly and probably will always
> be more expensive than writing code in a direct style. Saving and
> restoring contexts has some overhead, and it's harder for the compiler
> to reason about it. How would you be able to inline the computation,
> even if the coroutine is known? There's probably a way, but I don't
> know of any compiler that does this. It must have a cost in Python
> too, but maybe the penalty isn't so great because your Python code
> will already be slow for other reasons.

That's what I thought, but I hadn't even considered inlining and
optimisation - as Python is interpreted, considerations like that
don't apply. It sounds like avoiding over-engineering is the right
thing to do here...

> In general:
> I like your idea of making a set of generic words for iterators. There
> could be several classes that implement the iterator words: sequences,
> coroutine-based generators, linked lists, deques, hashtables, etc.
> This would be a bit of code to implement each of them, a little more
> than if you went with coroutines because you need an explicit
> representation of the cursor. But if a system like that existed, you
> could write code that iterates over any type of object. You could
> implement 'each', for example, in terms of iterators, and if 'like'
> were part of the protocol, you could implement 'map'.
>
> It would be great to replace the sequence protocol with this. Then,
> many of the sequence words wouldn't be tied to operating on
> random-access structures. Code duplication in the standard library
> could be reduced.

That was what started me thinking about this. I want to do some fairly
generic manipulation of streams of random numbers. My Python
experience led me straight away to thinking in terms of an infinite
iterator stream, but Factor's sequence protocol isn't suitable as
sequences have to know their length. That's a shame as it means that
map and each can't be used on lazy or infinite streams (hence those
vocabularies defining lmap and equivalents).

> The big challenge is doing this without performance regressions. The
> code should be expressed in a way that the compiler can optimize out
> the overhead of using iterator objects rather that doing the iteration
> directly. This is impossible for coroutines (at least with the
> compiler we have right now), but it could be possible for iterators
> over sequences, if enough type information is known by the compiler.

Phew, that's more of a challenge than I was considering - I know
little about the subtleties of the compiler and optimizer, so it never
occurred to me to think in terms like this.

Is there a test suite which could be used to ensure that such a change
didn't introduce performance regressions? The individual vocabulary
unit tests don't really seem to be designed to cover this sort of
thing - and I don't see a way of running all the vocab unit tests as a
group to do a full test, in any case.

> Maybe you want to make these iterators functional. That is, maybe the
> next operation should return a new iterator that stands for the new
> location, without modifying the old iterator object. One advantage of
> this is that the compiler understands how to do escape analysis on
> immutable tuples. This is the only way I could see iterators being
> implemented that wouldn't be a performance regression for sequence
> operations. A disadvantage is that, if escape analysis can't be done,
> it allocates more memory.

Interesting food for thought (as in, it barely makes sense to me at
the moment, so I'll have some fun doing some research to understand
the issues you're talking about :-)) One possible disadvantage of
making them functional would be that in effect you're allowing cloning
of iterators: dup next would result in 2 iterators, which "point to"
successive items. If you consider my "iterate over successive return
values of a function" example, to do this would need arbitrary amounts
of intermediate storage - dup next next next ... (N times) needs to
store N results, so that the original iterator (second in the stack at
this point) can return the same values at a later point (again, assume
a true random number generator or a user input reader for examples of
why you'd need to store results rather than recalculate them).

Thanks for the detailed and interesting comments.

Paul.

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Factor-talk mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/factor-talk

Re: [Factor-talk] Playing with python-like iterators (and efficiency of continuations)

Reply via email to