[
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444972#comment-17444972
]
liu edited comment on SPARK-37325 at 11/17/21, 7:14 AM:
--------------------------------------------------------
[~hyukjin.kwon] I don't think it will work. I have to make the questions
clear, I have two dataframe, one spark dataframe and one pd.dataframe, I want
to use
{code:java}
@pandas_udf{code}
for it's faster, however, it have some unknown error.
I can solve it by
{code:java}
@udf{code}
like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}
was (Author: JIRAUSER280209):
[~hyukjin.kwon] I don't think it will work. I have to make the questions
clear, I have to dataframe, one spark dataframe and one pd.dataframe, I want to
use @pandas_udf
for it's faster, however, it have some unknown error.
I can solve it by @udf
like below, but it's slow:
{code:java}
@udf(returnType=IntegerType())
def udf_match1(word):
my_Series = pd_fname.squeeze() # dataframe to Series
num = my_Series.str.contains(word).sum()
return int(num){code}
> Result vector from pandas_udf was not the required length
> ---------------------------------------------------------
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.2.0
> Environment: 1
> Reporter: liu
> Priority: Major
>
>
> {code:java}
> schema = StructType([
> StructField("node", StringType())
> ])
> rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")
> rdd = rdd.map(lambda obj: {'node': obj})
> df_node = spark.createDataFrame(rdd, schema=schema)
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> @pandas_udf(IntegerType(), PandasUDFType.SCALAR)
> def udf_match(word: pd.Series) -> pd.Series:
> my_Series = pd_fname.squeeze() # dataframe to Series
> num = int(my_Series.str.contains(word.array[0]).sum())
> return pd.Series(num)
> df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))
> {code}
>
> Hi, I have two dataframe, and I try above method, however, I get this
> {code:java}
> RuntimeError: Result vector from pandas_udf was not the required length:
> expected 100, got 1{code}
> it will be really thankful, if there is any helps
>
> PS: for the method itself, I think there is no problem, I create same sample
> data to verify it successfully, however, when I use the real data error came.
> I checked the data, can't figure out,
> does anyone thinks where it cause?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]