> The problem is that the iterator interface only defines 'hasNext' and > 'next' methods.
Just a comment from the peanut gallery, but FWIW it seems like being able to ask "how much data is here" would be a useful thing for Spark to know, even if that means moving away from Iterator itself, or something like IteratorWithSizeEstimate/something/something. Not only for this, but so that, ideally, Spark could basically do dynamic partitioning. E.g. when we load a month's worth of data, it's X GB, but after a few maps and filters, it's X/100 GB, so could use X/100 partitions instead. But right now all partitioning decisions are made up-front, via .coalesce/etc. type hints from the programmer, and it seems if Spark could delay making partitioning decisions each until RDD could like lazily-eval/sample a few lines (hand waving), that would be super sexy from our respective, in terms of doing automatic perf/partition optimization. Huge disclaimer that this is probably a big pita to implement, and could likely not be as worthwhile as I naively think it would be. - Stephen