[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

Neil McQuarrie (JIRA) Mon, 14 Aug 2017 13:13:21 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Neil McQuarrie updated SPARK-21727:
-----------------------------------
    Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as ArrayType(Double). 

However, any subsequent operation on this SparkR DataFrame appears to throw an 
error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices                                                       data 
1       1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2       2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3       3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4       4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array<double>
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as having type ArrayType(Double). 

However, any subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices                                                       data 
1       1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2       2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3       3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4       4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array<double>
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



> Operating on an ArrayType in a SparkR DataFrame throws error
> ------------------------------------------------------------
>
>                 Key: SPARK-21727
>                 URL: https://issues.apache.org/jira/browse/SPARK-21727
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 2.2.0
>            Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>    ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>    ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>    ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>    ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices                                                       data 
> 1       1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2       2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3       3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4       4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice the list column was successfully 
> converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array<double>
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

Reply via email to