liu created SPARK-37325:
---------------------------

             Summary: Result vector from pandas_udf was not the required length
                 Key: SPARK-37325
                 URL: https://issues.apache.org/jira/browse/SPARK-37325
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.2.0
         Environment: 1
            Reporter: liu


{{schema = StructType([
        StructField("node", StringType())
    ])}}
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}
{{}}
{{}}
{{df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()}}
{{}}
{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze()    # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
  {{    return pd.Series(num)}}
{{}}
{{}}
{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to