[Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

Olivier Girardot Fri, 17 Apr 2015 06:08:29 -0700

Hi everyone,
I had an issue trying to use Spark SQL from Java (8 or 7), I tried to
reproduce it in a small test case close to the actual documentation
<https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection>,
so sorry for the long mail, but this is "Java" :


import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

class Movie implements Serializable {
    private int id;
    private String name;

    public Movie(int id, String name) {
        this.id = id;
        this.name = name;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }
}

public class SparkSQLTest {
    public static void main(String[] args) {
        SparkConf conf = new SparkConf();
        conf.setAppName("My Application");
        conf.setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);

        ArrayList<Movie> movieArrayList = new ArrayList<Movie>();
        movieArrayList.add(new Movie(1, "Indiana Jones"));

        JavaRDD<Movie> movies = sc.parallelize(movieArrayList);

        SQLContext sqlContext = new SQLContext(sc);
        DataFrame frame = sqlContext.applySchema(movies, Movie.class);
        frame.registerTempTable("movies");

        sqlContext.sql("select name from movies")

*                .map(row -> row.getString(0)) // this is what i would
expect to work *                .collect();
    }
}


But this does not compile, here's the compilation error :

[ERROR]
/Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/MainSQL.java:[37,47]
method map in class org.apache.spark.sql.DataFrame cannot be applied to
given types;
[ERROR] *required:
scala.Function1<org.apache.spark.sql.Row,R>,scala.reflect.ClassTag<R> *
[ERROR]* found: (row)->"Na[...]ng(0) *
[ERROR] *reason: cannot infer type-variable(s) R *
[ERROR] *(actual and formal argument lists differ in length) *
[ERROR]
/Users/ogirardot/Documents/spark/java-project/src/main/java/org/apache/spark/SampleSHit.java:[56,17]
method map in class org.apache.spark.sql.DataFrame cannot be applied to
given types;
[ERROR] required:
scala.Function1<org.apache.spark.sql.Row,R>,scala.reflect.ClassTag<R>
[ERROR] found: (row)->row[...]ng(0)
[ERROR] reason: cannot infer type-variable(s) R
[ERROR] (actual and formal argument lists differ in length)
[ERROR] -> [Help 1]

Because in the DataFrame the *map *method is defined as :

[image: Images intégrées 1]

And once this is translated to bytecode the actual Java signature uses a
Function1 and adds a ClassTag parameter.
I can try to go around this and use the scala.reflect.ClassTag$ like that :

ClassTag$.MODULE$.apply(String.class)

To get the second ClassTag parameter right, but then instantiating a
java.util.Function or using the Java 8 lambdas fail to work, and if I
try to instantiate a proper scala Function1... well this is a world of
pain.

This is a regression introduced by the 1.3.x DataFrame because
JavaSchemaRDD used to be JavaRDDLike but DataFrame's are not (and are
not callable with JFunctions), I can open a Jira if you want ?

Regards,

-- 
*Olivier Girardot* | Associé
[email protected]
+33 6 24 09 17 94

[Spark SQL] Java map/flatMap api broken with DataFrame in 1.3.{0,1}

Reply via email to