[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733175#comment-14733175
 ] 

Ryan Schmitt commented on SPARK-3369:
-------------------------------------

I've found a reasonably elegant workaround for this issue (which was, at first, 
enormously confusing and worrisome, since it seemed like it would force 
buffering instead of streaming). Since Iterable is a SAM type (single abstract 
method) in Java 8, you can simply convert an Iterator<T> into an Iterable<T> 
using a lambda expression:

{code:java}
public Iterable<MyClass> call(Iterator<MyClass> inputIterator) {
  Iterator<MyClass> myReturnValue = ...;
  return () -> myReturnValue;
}
{code}

Interoperation with Java 8 Streams is a bit less convenient, but still totally 
doable:

{code:java}
public Iterable<MyClass> call(Iterator<MyClass> input) {
  final int characteristics = Spliterator.NONNULL | Spliterator.ORDERED;
  Spliterator<MyClass> spliterator = Spliterators.spliteratorUnknownSize(input, 
characteristics);
  Stream<MyClass> inputStream = StreamSupport.stream(spliterator, false);
  Iterator<MyClass> myReturnValue = 
inputStream.map(this::method).filter(that::method).iterator();
  return () -> myReturnValue;
}
{code}

It's also helpful to use AbstractIterator (from Guava) to create iterators; 
it's almost as easy as the yield statement in Python.

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-3369
>                 URL: https://issues.apache.org/jira/browse/SPARK-3369
>             Project: Spark
>          Issue Type: Improvement
>          Components: Java API
>    Affects Versions: 1.0.2, 1.2.1
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>              Labels: breaking_change
>         Attachments: FlatMapIterator.patch
>
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to