[
https://issues.apache.org/jira/browse/SPARK-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-3387.
------------------------------
Resolution: Not a Problem
Your code does not include a call to {{groupBy}}, right? It calls
{{groupByKey}} and that's part of the list here.
Yes, the actual execution plan does not map directly to what the user called.
Some user-facing API methods are not distributed operations at all; some invoke
several different distributed operations. I think this is as-intended.
> Misleading stage description on the driver UI
> ---------------------------------------------
>
> Key: SPARK-3387
> URL: https://issues.apache.org/jira/browse/SPARK-3387
> Project: Spark
> Issue Type: Bug
> Components: Web UI
> Affects Versions: 1.0.2
> Environment: Java 1.6, OSX Mountain Lion
> Reporter: Christian Chua
>
> Steps to reproduce : compile and run this modified version of the 1.0.2
> pagerank example :
> public static void main(String[] args) throws Exception {
> JavaSparkContext sc = new JavaSparkContext("local[8]", "Sample");
> JavaRDD < String > inputRDD = sc.textFile(INPUT_FILE,1);
> JavaPairRDD < String , String > a = inputRDD.mapToPair(new
> PairFunction < String , String , String >() {
> @Override
> public Tuple2 < String , String > call(String s) throws Exception
> {
> String[] parts = SPACES.split(s);
> return new Tuple2 < String , String >(parts[0], parts[1]);
> }
> });
> JavaPairRDD < String , String > b = a.distinct();
> JavaPairRDD < String , Iterable < String >> c = b.groupByKey(11);
> System.out.println(c.toDebugString());
> System.out.println(c.collect());
> JOptionPane.showMessageDialog(null, "Last Line");
> sc.stop();
> }
> The debug string will appear as :
> MappedValuesRDD[11] at groupByKey at Sample.java:45 (11 partitions)
> MappedValuesRDD[10] at groupByKey at Sample.java:45 (11 partitions)
> MapPartitionsRDD[9] at groupByKey at Sample.java:45 (11 partitions)
> ShuffledRDD[8] at groupByKey at Sample.java:45 (11 partitions)
> MappedRDD[7] at distinct at Sample.java:41 (1 partitions)
> MapPartitionsRDD[6] at distinct at Sample.java:41 (1 partitions)
> ShuffledRDD[5] at distinct at Sample.java:41 (1 partitions)
> MapPartitionsRDD[4] at distinct at Sample.java:41 (1 partitions)
> MappedRDD[3] at distinct at Sample.java:41 (1 partitions)
> MappedRDD[2] at mapToPair at Sample.java:30 (1 partitions)
> MappedRDD[1] at textFile at Sample.java:28 (1 partitions)
> HadoopRDD[0] at textFile at Sample.java:28 (1
> partitions)
> The problem is that the "list of stages" in the UI (localhost:4040) does not
> mention anything about "groupBy"
> In fact it mentions "distinct" twice:
> stage 0 : collect
> stage 1 : distinct
> stage 2 : distinct
> This is piece of misleading information can confuse the learner significantly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]