[ 
https://issues.apache.org/jira/browse/SPARK-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3387.
------------------------------
    Resolution: Not a Problem

Your code does not include a call to {{groupBy}}, right? It calls 
{{groupByKey}} and that's part of the list here.
Yes, the actual execution plan does not map directly to what the user called. 
Some user-facing API methods are not distributed operations at all; some invoke 
several different distributed operations. I think this is as-intended.

> Misleading stage description on the driver UI
> ---------------------------------------------
>
>                 Key: SPARK-3387
>                 URL: https://issues.apache.org/jira/browse/SPARK-3387
>             Project: Spark
>          Issue Type: Bug
>          Components: Web UI
>    Affects Versions: 1.0.2
>         Environment: Java 1.6, OSX Mountain Lion
>            Reporter: Christian Chua
>
> Steps to reproduce : compile and run this modified version of the 1.0.2 
> pagerank example :
>     public static void main(String[] args) throws Exception {
>         JavaSparkContext sc = new JavaSparkContext("local[8]", "Sample");
>         JavaRDD < String > inputRDD = sc.textFile(INPUT_FILE,1);
>         JavaPairRDD < String , String > a = inputRDD.mapToPair(new 
> PairFunction < String , String , String >() {
>             @Override
>             public Tuple2 < String , String > call(String s) throws Exception 
> {
>                 String[] parts = SPACES.split(s);
>                 return new Tuple2 < String , String >(parts[0], parts[1]);
>             }
>         });
>         JavaPairRDD < String , String > b = a.distinct();
>         JavaPairRDD < String , Iterable < String >> c = b.groupByKey(11);
>         System.out.println(c.toDebugString());
>         System.out.println(c.collect());
>         JOptionPane.showMessageDialog(null, "Last Line");
>         sc.stop();
>     }
> The debug string will appear as :
> MappedValuesRDD[11] at groupByKey at Sample.java:45 (11 partitions)
>   MappedValuesRDD[10] at groupByKey at Sample.java:45 (11 partitions)
>     MapPartitionsRDD[9] at groupByKey at Sample.java:45 (11 partitions)
>       ShuffledRDD[8] at groupByKey at Sample.java:45 (11 partitions)
>         MappedRDD[7] at distinct at Sample.java:41 (1 partitions)
>           MapPartitionsRDD[6] at distinct at Sample.java:41 (1 partitions)
>             ShuffledRDD[5] at distinct at Sample.java:41 (1 partitions)
>               MapPartitionsRDD[4] at distinct at Sample.java:41 (1 partitions)
>                 MappedRDD[3] at distinct at Sample.java:41 (1 partitions)
>                   MappedRDD[2] at mapToPair at Sample.java:30 (1 partitions)
>                     MappedRDD[1] at textFile at Sample.java:28 (1 partitions)
>                       HadoopRDD[0] at textFile at Sample.java:28 (1 
> partitions)
> The problem is that the "list of stages" in the UI (localhost:4040) does not 
> mention anything about "groupBy" 
> In fact it mentions "distinct" twice:
> stage 0 : collect
> stage 1 : distinct
> stage 2 : distinct
> This is piece of misleading information can confuse the learner significantly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to