[ https://issues.apache.org/jira/browse/SPARK-7804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555154#comment-14555154 ]
Josh Rosen commented on SPARK-7804: ----------------------------------- (I'm a bit new to Spark SQL internals, so please forgive me if this is off-base) I think that all-caps {{JDBCRDD}} is an internal SQL class that's not designed to be used by end users (it's marked as {{private\[sql]}} in the code (https://github.com/apache/spark/blob/v1.3.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L208). Spark SQL's internal RDDs have {{compute()}} methods that return iterators which same mutable object on each {{next()}} call. When you're calling {{cache()}} on this RDD, you end up with a cached RDD that contains the same mutable row object repeated many times, leading to the duplicate records that you're seeing here. In a nutshell, I don't think that the example given here is valid because it's using an internal Spark SQL class in a way that it does not support. If you want to load data from JDBC and access it as an RDD, I think the right way to do this is to use SQLContext.load to load the data from JDBC into a dataFrame, then to call {{toRDD}} on the resulting DataFrame. See https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases for more details. > Incorrect results from JDBCRDD -- one record repeatly > ----------------------------------------------------- > > Key: SPARK-7804 > URL: https://issues.apache.org/jira/browse/SPARK-7804 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Paul Wu > Labels: JDBCRDD, sql > > Getting only one record repeated in the RDD and repeated field value: > > I have a table like: > {code} > attuid name email > 12 john j...@appp.com > 23 tom t...@appp.com > 34 tony t...@appp.com > {code} > My code: > {code} > JavaSparkContext sc = new JavaSparkContext(sparkConf); > String url = "...."; > java.util.Properties prop = new Properties(); > List<JDBCPartition> partitionList = new ArrayList<>(); > //int i; > partitionList.add(new JDBCPartition("1=1", 0)); > > List<StructField> fields = new ArrayList<StructField>(); > fields.add(DataTypes.createStructField("attuid", > DataTypes.StringType, true)); > fields.add(DataTypes.createStructField("name", DataTypes.StringType, > true)); > fields.add(DataTypes.createStructField("email", DataTypes.StringType, > true)); > StructType schema = DataTypes.createStructType(fields); > JDBCRDD jdbcRDD = new JDBCRDD(sc.sc(), > JDBCRDD.getConnector("oracle.jdbc.OracleDriver", url, prop), > > schema, > " USERS", > new String[]{"attuid", "name", "email"}, > new Filter[]{ }, > > partitionList.toArray(new JDBCPartition[0]) > > ); > > System.out.println("count before to Java RDD=" + > jdbcRDD.cache().count()); > JavaRDD<Row> jrdd = jdbcRDD.toJavaRDD(); > System.out.println("count=" + jrdd.count()); > List<Row> lr = jrdd.collect(); > for (Row r : lr) { > for (int ii = 0; ii < r.length(); ii++) { > System.out.println(r.getString(ii)); > } > } > {code} > =========================== > result is : > {code} > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > 34 > tony > t...@appp.com > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org