[jira] [Commented] (SPARK-14534) Should SparkContext.parallelize(List) take an Iterable instead?

Sean Owen (JIRA) Mon, 11 Apr 2016 06:54:12 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15235112#comment-15235112
 ]


Sean Owen commented on SPARK-14534:
-----------------------------------

It sounds like it makes sense, though looking at the internal implementation, 
it relies reasonably thoroughly on knowing the size of the input. That's not a 
deal-breaker. This is a key API so changing the signature has to happen in a 
major release, but 2.x is coming. If it's source-compatible by merely weakening 
the type of the arg, that's a lot better.

I suppose the assumption is that if a data set is not small, then it's best to 
load the data set in a distributed way and not send it all from the driver. 
That is, if your data set is so big that this is an issue, you don't want to 
parallelize. Instead executors could be doing the loading given an RDD of keys 
or something. I think this is why the signature kind of "enforces" that; it may 
be more on purpose than not.

> Should SparkContext.parallelize(List) take an Iterable instead?
> ---------------------------------------------------------------
>
>                 Key: SPARK-14534
>                 URL: https://issues.apache.org/jira/browse/SPARK-14534
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: David Wood
>            Priority: Minor
>
> I am using MongoDB to read the DB and it provides an Iterable (and not a 
> List) to access the results.  This is similar to the ResultSet in SQL and is 
> done this way so that you can process things row by row and not have to pull 
> in a potentially large DB all at once.  It might be nice if parallelize(List) 
> could instead operate on an Iterable to allow a similar efficience.   SInce a 
> List is an Iterable, this would would be backwards compatible.  However, I'm 
> new to Spark so not sure if that might violate some other design point.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14534) Should SparkContext.parallelize(List) take an Iterable instead?

Reply via email to