Yin Huai created SPARK-12662: -------------------------------- Summary: Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition Key: SPARK-12662 URL: https://issues.apache.org/jira/browse/SPARK-12662 Project: Spark Issue Type: Bug Components: Documentation, SQL Reporter: Yin Huai
With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following code will provide overlapped rows for two DFs returned by the randomSplit. {code} sqlContext.sql("drop table if exists test") val x = sc.parallelize(1 to 210) case class R(ID : Int) sqlContext.createDataFrame(x.map {R(_)}).write.format("json").saveAsTable("bugsc1597") var df = sql("select distinct ID from test") var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L) a.registerTempTable("a") b.registerTempTable("b") val intersectDF = a.intersect(b) intersectDF.show {code} The reason is that {{sql("select distinct ID from test")} does not guarantee the ordering rows in a partition. It will be good to add more document to the api doc to explain it. To make intersectDF contain 0 row, the df needs to have fixed row ordering within a partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org