[ 
https://issues.apache.org/jira/browse/SPARK-35814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35814.
----------------------------------
    Resolution: Invalid

> Mismatched types when creating a new column when using Arrow
> ------------------------------------------------------------
>
>                 Key: SPARK-35814
>                 URL: https://issues.apache.org/jira/browse/SPARK-35814
>             Project: Spark
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 3.1.2
>            Reporter: Nic Crane
>            Priority: Minor
>
> I was looking into a question on StackOverflow where a user had an error due 
> to a mismatch of field types  
> (https://stackoverflow.com/questions/68032084/how-to-fix-error-in-readbin-using-arrow-package-with-sparkr/).
>   I've made a reproducible example below.
>  
> {code:java}
> library(SparkR)
> library(arrow)
> SparkR::sparkR.session(sparkConfig = 
> list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> readr::write_csv(iris, "iris.csv")dfSchema <- structType(
>   structField("Sepal.Length", "int"),
>   structField("Sepal.Width", "double"),
>   structField("Petal.Length", "double"),
>   structField("Petal.Width", "double"),
>   structField("Species", "string")
> )
> spark_df <- SparkR::read.df(path="iris.csv", source="csv", schema=dfSchema)
> # Apply an R native function to each partition.
> returnSchema <-  structType(structField("Sep_add_one", "int"))
> df <- SparkR::dapply(spark_df, function(rdf){
>   data.frame(rdf$Sepal.Length + 1)
>   }, returnSchema
> )
> collect(df)
> {code}
> It results in this error:
> {code:java}
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : 
>   invalid 'n' argument
> {code}
> I've never used SparkR before, so could be wrong, but I *think* it may be due 
> to the use of this line in DataFrame.R:
> {code:java}
> arrowTable <- arrow::read_ipc_stream(readRaw(conn))
> {code}
> readRaw assumes that the the value is an integer; however when returnSchema 
> is being created, the code creates a double (the code runs without error if 
> it's updated to `rdf$Sepal.Length + 1L`)
> I wasn't sure if perhaps readTypedObject needs to be used or if maybe some 
> type checking might be useful?
> Apologies for the partial explanation - not entirely sure of the cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to