liu created SPARK-37325:
---------------------------
Summary: Result vector from pandas_udf was not the required length
Key: SPARK-37325
URL: https://issues.apache.org/jira/browse/SPARK-37325
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.2.0
Environment: 1
Reporter: liu
{{schema = StructType([
StructField("node", StringType())
])}}
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}
{{}}
{{}}
{{df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()}}
{{}}
{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{ my_Series = pd_fname.squeeze() # dataframe to Series}}
{{ num = int(my_Series.str.contains(word.array[0]).sum())}}
{{ return pd.Series(num)}}
{{}}
{{}}
{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length:
expected 100, got 1}}
it will be really thankful, if there is any helps
PS: for the method itself, I think there is no problem, I create same sample
data to verify it successfully, however, when I use the really data it came. I
checked the data, can't figure out,
does anyone thinks where it cause?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]