[
https://issues.apache.org/jira/browse/SPARK-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060668#comment-14060668
]
Hans Uhlig edited comment on SPARK-2278 at 7/14/14 2:14 PM:
------------------------------------------------------------
So I can see two places this becomes painful quickly.
Maps while cheap are not free; they also functionally describe a data mutation
rather than a transformation modifier. This might seems like a small syntactic
nuance but it can make processing large datasets painful.
Secondly, handling composite keys. This is something that spark seems to ignore
almost everywhere is that keys contain data too, there is no need to replicate
your key into your value field for a couple hundred billion records. I don't
want to have 6 separate copies of a Key class representing who what where when,
just because I need to sort them or group them in different orders. I often
order things by natural order: who, what, where, and when. I then group by a
lesser order, who, what, where. I shouldn't need to create an entire new Key
class just to change ordering.
I can see something like this:
JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int
numPartitions)
JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner
partitioner, int numPartitions)
JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>>
groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int
numPartitions)
JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>>
groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int
numPartitions)
Also, what is the rationale that none of the reduction functions, reduceBy,
groupBy, etc receive the key of the data they are reducing.
was (Author: huhlig):
So I can see two places this becomes painful quickly.
Maps while cheap are not free; they also functionally describe a data mutation
rather than a transformation modifier. This might seems like a small syntactic
nuance but it can make processing large datasets painful.
Secondly, handling composite keys. This is something that spark seems to ignore
almost everywhere is that keys contain data too, there is no need to replicate
your key into your value field for a couple hundred billion records. I don't
want to have 6 seperate copies of a Key class representing who what where when,
just because I need to sort them or group them in different orders. I often
order things by natural order: who, what, where, and when. I then group by a
lesser order, who, what, where. I shouldn't need to create an entire new Key
class just to change ordering.
I can see something like this:
JavaRDD<K,V> JavaRDD.sortBy(Comparator comp, Partitioner partitioner, int
numPartitions)
JavaPairRDD<K,V> JavaPairRDD.sortByKey(Comparator comp, Partitioner
partitioner, int numPartitions)
JavaRDD<K,Iterable<T>> JavaRDD.groupBy(JavaPairRDD<K,Iterable<T>>
groupBy(Function<T,K> func()), Comparator comp, Partitioner partitioner, int
numPartitions)
JavaPairRDD<K,Iterable<V>> JavaPairRDD.groupByKey( JavaPairRDD<K,Iterable<T>>
groupBy(Function<T,K> func), Comparator comp, Partitioner partitioner, int
numPartitions)
Also, what is the rationale that none of the reduction functions, reduceBy,
groupBy, etc receive the key of the data they are reducing.
> groupBy & groupByKey should support custom comparator
> -----------------------------------------------------
>
> Key: SPARK-2278
> URL: https://issues.apache.org/jira/browse/SPARK-2278
> Project: Spark
> Issue Type: New Feature
> Components: Java API
> Affects Versions: 1.0.0
> Reporter: Hans Uhlig
>
> To maintain parity with MapReduce you should be able to specify a custom key
> equality function in groupBy/groupByKey similar to sortByKey.
--
This message was sent by Atlassian JIRA
(v6.2#6252)