[ 
https://issues.apache.org/jira/browse/SPARK-19895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904951#comment-15904951
 ] 

Sean Owen commented on SPARK-19895:
-----------------------------------

It's not clear what the problem is. What result do you expect, what do you see, 
and what have you found when debugging the problem? what step doesn't give the 
result you expect?

> Spark SQL could not output a correct result
> -------------------------------------------
>
>                 Key: SPARK-19895
>                 URL: https://issues.apache.org/jira/browse/SPARK-19895
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Bin Wu
>            Priority: Minor
>              Labels: beginner
>
> I'm rewriting pagerank algorithm with Spark SQL, following is the code:
> from pyspark.sql.functions import *
> from pyspark.sql import SparkSession
> spark = SparkSession \
>         .builder \
>         .appName("Python Spark SQL basic example") \
>         .config("spark.some.config.option", "some-value") \
>         .getOrCreate()
> numOfIterations = 5                                                           
>                   
> lines = spark.read.text("pagerank_data.txt")
> a = lines.select(split(lines[0],' '))
> links = a.select(a[0][0].alias('src'), a[0][1].alias('dst'))
> outdegrees = links.groupBy('src').count()
> ranks = outdegrees.select('src', lit(1).alias('rank'))
> for iteration in range(numOfIterations):
>     contribs = links.join(ranks, 'src').join(outdegrees, 'src').select('dst', 
> (ranks['rank']/outdegrees['count']).alias('contrib'))
>     #ranks = 
> contribs.groupBy('dst').sum('contrib').select(column('dst').alias('src'), 
> (column('sum(contrib)')*0.85+0.15).alias('rank'))
>     ranks = 
> contribs.withColumnRenamed('dst','dst').groupBy('dst').sum('contrib').select(column('dst').alias('src'),
>  (column('sum(contrib)')*0.85+0.15).alias('rank'))
> ranks.orderBy(desc('rank')).show()
> pagerank_data.txt only has several edges:
> 1 2
> 1 3
> 1 4
> 2 1
> 3 1
> 4 1
> It cannot output correct rank for each node for this small graph only if I 
> use the "withColumnRenamed". However, on large data set, the line without 
> withColumnRenamed works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to