[
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-14425.
----------------------------------
Resolution: Incomplete
> SQL/dataframe join error: mixes up the columns
> -----------------------------------------------
>
> Key: SPARK-14425
> URL: https://issues.apache.org/jira/browse/SPARK-14425
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 1.6.1
> Environment: databricks cloud
> Reporter: venu k tangirala
> Priority: Major
> Labels: bulk-closed
>
> I am running this on databricks cloud.
> I am running a join operation and the result has a the columns mixed up.
> Here is an example:
> the original df:
> {code:none}
> >>>f_df.take(3)
> [Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5',
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM',
> productId=374695023, pageId=2617232),
> Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5',
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM',
> productId=374694787, pageId=2617240),
> Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5',
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM',
> productId=374694787, pageId=2617247)]
> {code}
> As I am trying to build a recommendation system, and for ALS Ratings user_id
> and product_id have to be int, I am mapping them as follows:
> {code:none}
> # mapping string to int for visitors
> visitorId_toInt = f_df.map(lambda
> x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
> # print visitorId_toInt.take(3)
> visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for SQL
> # mapping long to int for products
> productId_toInt= f_df.map(lambda
> x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
> # print productId_toInt.take(3)
> productId_toInt.registerTempTable("productId_toInt") #doing this only for SQL
> f_df.registerTempTable("f_df") #doing this only for SQL
> {code}
> Now I do the join and get the int versions of user_id and product_id as
> follows, I tried it with both dataFrame join and SQL join, both have the same
> error:
> {code:none}
> tmp = f_df\
> .join(visitorId_toInt,
> f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
> .select(f_df["idSite"], f_df["servertimestamp"],
> visitorId_toInt["int_visitorId"],\
> f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"],
> f_df["pageId"] )
>
> ratings_df =
> tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
> .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
> tmp["sessionId"],tmp["serverTimePretty"],
> productId_toInt["int_productId"], tmp["pageId"] )
> {code}
> The SQL version:
> {code:none}
> ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId,
> sessionId, serverTimePretty, int_productId, pageId \
> FROM f_df \
> INNER JOIN visitorId_toInt \
> ON f_df.visitorId = visitorId_toInt.visitorId \
> INNER JOIN productId_toInt \
> ON f_df.productId = productId_toInt.productId \
> ")
> {code}
> Here is the result of the join:
> {code:none}
> >>>ratings_df.take(3)
> [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774,
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936,
> pageId=100),
> Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774,
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936,
> pageId=100),
> Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271,
> sessionId=None, serverTimePretty=u'375989339', int_productId=1060,
> pageId=100)]
> {code}
> The idSite in my dataset is 100 for all the rows, and some how thats being
> assigned to pageId.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]