[ https://issues.apache.org/jira/browse/CRUNCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269783#comment-14269783 ]
Josh Wills commented on CRUNCH-485: ----------------------------------- No, nothing else from you-- I need to write a test for it to verify that it works on the kind of schemas you described, which should be pretty straightforward. If you don't want to wait for 0.12, you can cut your own version against 0.12.0-SNAPSHOT after I check it in. Thanks again Tycho! > groupByKey on Spark incorrect if key is Avro record with defined sort order > --------------------------------------------------------------------------- > > Key: CRUNCH-485 > URL: https://issues.apache.org/jira/browse/CRUNCH-485 > Project: Crunch > Issue Type: Bug > Components: Core > Affects Versions: 0.11.0 > Reporter: Tycho Lamerigts > Assignee: Josh Wills > Attachments: CRUNCH-485.patch > > > GroupByKey on Spark is incorrect if the key type is an Avro record with > defined sort order (http://avro.apache.org/docs/1.7.7/spec.html#order). > Instead, it serializes the entire avro record to a binary blob (byte array) > and groups identical blobs. This is wrong. By contrast, groupByKey on > MapReduce works as expected, so it does take Avro's sort order into account. > The culprit is probably the following code from > org.apache.crunch.impl.spark.collect.PGroupedTableImpl#getJavaRDDLikeInternal > {code} > groupedRDD = parentRDD.map(new PairMapFunction(ptype.getOutputMapFn(), > runtime.getRuntimeContext())) > .mapToPair(new MapOutputFunction(keySerde, valueSerde)) > .groupByKey(numPartitions); > {code} > where MapOutputFunction simply converts the entire key object to a binary > blob, without taking sort order into account. -- This message was sent by Atlassian JIRA (v6.3.4#6332)