[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

Kevin Ushey (JIRA) Fri, 30 Sep 2016 15:20:06 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kevin Ushey updated SPARK-17752:
--------------------------------
    Description: 
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

{code:r}
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
{code}

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)


  was:
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

{{
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
}}

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)



> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-17752
>                 URL: https://issues.apache.org/jira/browse/SPARK-17752
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.0.0
>            Reporter: Kevin Ushey
>            Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> path <- tempfile()
> write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
> quote = FALSE)
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

Reply via email to