I created a pull request last night for a new InputSource API that is
essentially a stripped down version of the RDD API for providing data into
Spark. Would be great to hear the community's feedback.

Spark currently has two de facto input source API:
1. RDD
2. Hadoop MapReduce InputFormat

Neither of the above is ideal:

1. RDD: It is hard for Java developers to implement RDD, given the implicit
class tags. In addition, the RDD API depends on Scala's runtime library,
which does not preserve binary compatibility across Scala versions. If a
developer chooses Java to implement an input source, it would be great if
that input source can be binary compatible in years to come.

2. Hadoop InputFormat: The Hadoop InputFormat API is overly restrictive.
For example, it forces key-value semantics, and does not support running
arbitrary code on the driver side (an example of why this is useful is
broadcast). In addition, it is somewhat awkward to tell developers that in
order to implement an input source for Spark, they should learn the Hadoop
MapReduce API first.


My patch creates a new InputSource interface, described by:

- an array of InputPartition that specifies the data partitioning
- a RecordReader that specifies how data on each partition can be read

This interface is similar to Hadoop's InputFormat, except that there is no
explicit key/value separation.


JIRA ticket: https://issues.apache.org/jira/browse/SPARK-7025
Pull request: https://github.com/apache/spark/pull/5603

Reply via email to