[
https://issues.apache.org/jira/browse/SPARK-35814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366086#comment-17366086
]
Hyukjin Kwon commented on SPARK-35814:
--------------------------------------
It's because the return type is not matched:
{code}
dapply(): expected IntegerType, got DoubleType
{code}
check your df and given returnSchema:
{code}
> spark_df
SparkDataFrame[Sepal.Length:int, Sepal.Width:double, Petal.Length:double,
Petal.Width:double, Species:string]
> returnSchema
StructType
|-name = "Sep_add_one", type = "IntegerType", nullable = TRUE
{code}
> Mismatched types when creating a new column when using Arrow
> ------------------------------------------------------------
>
> Key: SPARK-35814
> URL: https://issues.apache.org/jira/browse/SPARK-35814
> Project: Spark
> Issue Type: Bug
> Components: R
> Affects Versions: 3.1.2
> Reporter: Nic Crane
> Priority: Minor
>
> I was looking into a question on StackOverflow where a user had an error due
> to a mismatch of field types
> (https://stackoverflow.com/questions/68032084/how-to-fix-error-in-readbin-using-arrow-package-with-sparkr/).
> I've made a reproducible example below.
>
> {code:java}
> library(SparkR)
> library(arrow)
> SparkR::sparkR.session(sparkConfig =
> list(spark.sql.execution.arrow.sparkr.enabled = "true"))
> readr::write_csv(iris, "iris.csv")dfSchema <- structType(
> structField("Sepal.Length", "int"),
> structField("Sepal.Width", "double"),
> structField("Petal.Length", "double"),
> structField("Petal.Width", "double"),
> structField("Species", "string")
> )
> spark_df <- SparkR::read.df(path="iris.csv", source="csv", schema=dfSchema)
> # Apply an R native function to each partition.
> returnSchema <- structType(structField("Sep_add_one", "int"))
> df <- SparkR::dapply(spark_df, function(rdf){
> data.frame(rdf$Sepal.Length + 1)
> }, returnSchema
> )
> collect(df)
> {code}
> It results in this error:
> {code:java}
> Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
> invalid 'n' argument
> {code}
> I've never used SparkR before, so could be wrong, but I *think* it may be due
> to the use of this line in DataFrame.R:
> {code:java}
> arrowTable <- arrow::read_ipc_stream(readRaw(conn))
> {code}
> readRaw assumes that the the value is an integer; however when returnSchema
> is being created, the code creates a double (the code runs without error if
> it's updated to `rdf$Sepal.Length + 1L`)
> I wasn't sure if perhaps readTypedObject needs to be used or if maybe some
> type checking might be useful?
> Apologies for the partial explanation - not entirely sure of the cause.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]