[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner

zhengruifeng (JIRA) Tue, 11 Jun 2019 02:48:44 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860756#comment-16860756
 ]


zhengruifeng commented on SPARK-25360:
--------------------------------------

[~holdenk] i am afraid it is not doable to add a partitioner to \{RDD[Long]} 
generated by \{sc.range}, refering to the defination of partitioner.
{code:java}
/**
 * An object that defines how the elements in a key-value pair RDD are 
partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 *
 * Note that, partitioner must be deterministic, i.e. it must return the same 
partition id given
 * the same partition key.
 */{code}
Since the returned RDD[Long] is not a \{PairRDD}, so that following ops (like 
join, sort) which can utilize upstreaming partitioner.

 

An alternative is to add some method like `sc.tabulate[T](start, end, step, 
numSlices)(f: Long => T)`, so that the partitioner can be used in future ops.

> Parallelized RDDs of Ranges could have known partitioner
> --------------------------------------------------------
>
>                 Key: SPARK-25360
>                 URL: https://issues.apache.org/jira/browse/SPARK-25360
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: holdenk
>            Priority: Trivial
>
> We already have the logic to split up the generator, we could expose the same 
> logic as a partitioner. This would be useful when joining a small 
> parallelized collection with a larger collection and other cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-25360) Parallelized RDDs of Ranges could have known partitioner

Reply via email to