[ https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kevin Ushey updated SPARK-17752: -------------------------------- Description: Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary): {code:r} SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7") Sys.setenv(SPARK_HOME = SPARK_HOME) library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g")) n <- 1E3 df <- as.data.frame(replicate(n, 1L, FALSE)) names(df) <- paste("X", 1:n, sep = "") path <- tempfile() write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE) tbl <- as.DataFrame(df) cache(tbl) # works fine without this cl <- collect(tbl) identical(df, cl) # FALSE {code} Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources. For posterity: > sessionInfo() R version 3.3.1 Patched (2016-07-30 r71015) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra (10.12) was: Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 installation as necessary): {{ SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7") Sys.setenv(SPARK_HOME = SPARK_HOME) library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = "2g")) n <- 1E3 df <- as.data.frame(replicate(n, 1L, FALSE)) names(df) <- paste("X", 1:n, sep = "") path <- tempfile() write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", quote = FALSE) tbl <- as.DataFrame(df) cache(tbl) # works fine without this cl <- collect(tbl) identical(df, cl) # FALSE }} Although this is reproducible with SparkR, it seems more likely that this is an error in the Java / Scala Spark sources. For posterity: > sessionInfo() R version 3.3.1 Patched (2016-07-30 r71015) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra (10.12) > Spark returns incorrect result when 'collect()'ing a cached Dataset with many > columns > ------------------------------------------------------------------------------------- > > Key: SPARK-17752 > URL: https://issues.apache.org/jira/browse/SPARK-17752 > Project: Spark > Issue Type: Bug > Components: SparkR > Affects Versions: 2.0.0 > Reporter: Kevin Ushey > Priority: Critical > > Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 > installation as necessary): > {code:r} > SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7") > Sys.setenv(SPARK_HOME = SPARK_HOME) > library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) > sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = > "2g")) > n <- 1E3 > df <- as.data.frame(replicate(n, 1L, FALSE)) > names(df) <- paste("X", 1:n, sep = "") > path <- tempfile() > write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", > quote = FALSE) > tbl <- as.DataFrame(df) > cache(tbl) # works fine without this > cl <- collect(tbl) > identical(df, cl) # FALSE > {code} > Although this is reproducible with SparkR, it seems more likely that this is > an error in the Java / Scala Spark sources. > For posterity: > > sessionInfo() > R version 3.3.1 Patched (2016-07-30 r71015) > Platform: x86_64-apple-darwin13.4.0 (64-bit) > Running under: macOS Sierra (10.12) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org