[ https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
venu k tangirala updated SPARK-14425: ------------------------------------- Description: I am running this on databricks cloud. I am running a join operation and the result has a the columns mixed up. Here is an example: the original df: >>>df.take(3) [Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'), Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'), Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')] As I am trying to build a recommendation system, and ALS Ratings has to be user_is and product_id has to be int, I am mapping them as follows: {code:none} # mapping string to int for visitors visitorId_toInt = f_df.map(lambda x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"]) # print visitorId_toInt.take(3) visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the SQL # mapping long to int for products productId_toInt= f_df.map(lambda x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"]) # print productId_toInt.take(3) productId_toInt.registerTempTable("productId_toInt") #doing this only for the SQL f_df.registerTempTable("f_df") #doing this only for the SQL {code} Now I do the join and get the int versions of user_id and product_id as follows, I tried it with both dataFrame join and SQL join, both have the same error: {code:none} tmp = f_df\ .join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\ .select(f_df["idSite"], f_df["servertimestamp"], visitorId_toInt["int_visitorId"],\ f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], f_df["pageId"] ) ratings_df = tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\ .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\ tmp["sessionId"],tmp["serverTimePretty"], productId_toInt["int_productId"], tmp["pageId"] ) {code} The SQL version: {code:none} ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, sessionId, serverTimePretty, int_productId, pageId \ FROM f_df \ INNER JOIN visitorId_toInt \ ON f_df.visitorId = visitorId_toInt.visitorId \ INNER JOIN productId_toInt \ ON f_df.productId = productId_toInt.productId \ ") {code} Here is the result of the join: {code:none} >>>ratings_df.take(3) [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100), Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100), Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)] {code} The idSite in my dataset is 100 for all the rows, and some how thats being assigned to pageId. was: I am running this on databricks cloud. I am running a join operation and the result has a the columns mixed up. Here is an example: the original df: >>>df.take(3) [Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'), Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'), Row(idSite=u'100', servertimestamp=u'1455219299', visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')] As I am trying to build a recommendation system, and ALS Ratings has to be user_is and product_id has to be int, I am mapping them as follows: {code:none} # mapping string to int for visitors visitorId_toInt = f_df.map(lambda x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"]) # print visitorId_toInt.take(3) visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the SQL # mapping long to int for products productId_toInt= f_df.map(lambda x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"]) # print productId_toInt.take(3) productId_toInt.registerTempTable("productId_toInt") #doing this only for the SQL f_df.registerTempTable("f_df") #doing this only for the SQL {code} Now I do the join and get the int versions of user_id and product_id as follows, I tried it with both dataFrame join and SQL join, both have the same error: {code:none} tmp = f_df\ .join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\ .select(f_df["idSite"], f_df["servertimestamp"], visitorId_toInt["int_visitorId"],\ f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], f_df["pageId"] ) ratings_df = tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\ .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\ tmp["sessionId"],tmp["serverTimePretty"], productId_toInt["int_productId"], tmp["pageId"] ) {code} The SQL version: {code:none} ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, sessionId, serverTimePretty, int_productId, pageId \ FROM f_df \ INNER JOIN visitorId_toInt \ ON f_df.visitorId = visitorId_toInt.visitorId \ INNER JOIN productId_toInt \ ON f_df.productId = productId_toInt.productId \ ") {code} Here is the result of the join: {code:py} >>>ratings_df.take(3) [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100), Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100), Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)] {code} The idSite in my dataset is 100 for all the rows, and some how thats being assigned to pageId. > spark SQL/dataframe join error: mixes the columns up > ---------------------------------------------------- > > Key: SPARK-14425 > URL: https://issues.apache.org/jira/browse/SPARK-14425 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL > Affects Versions: 1.6.1 > Environment: databricks cloud > Reporter: venu k tangirala > Priority: Blocker > > I am running this on databricks cloud. > I am running a join operation and the result has a the columns mixed up. > Here is an example: > the original df: > >>>df.take(3) > [Row(idSite=u'100', servertimestamp=u'1455219299', > visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb > 11, 2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'), > Row(idSite=u'100', servertimestamp=u'1455219299', > visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb > 11, 2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'), > Row(idSite=u'100', servertimestamp=u'1455219299', > visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb > 11, 2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')] > As I am trying to build a recommendation system, and ALS Ratings has to be > user_is and product_id has to be int, I am mapping them as follows: > {code:none} > # mapping string to int for visitors > visitorId_toInt = f_df.map(lambda > x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"]) > # print visitorId_toInt.take(3) > visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the > SQL > # mapping long to int for products > productId_toInt= f_df.map(lambda > x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"]) > # print productId_toInt.take(3) > productId_toInt.registerTempTable("productId_toInt") #doing this only for the > SQL > f_df.registerTempTable("f_df") #doing this only for the SQL > {code} > Now I do the join and get the int versions of user_id and product_id as > follows, I tried it with both dataFrame join and SQL join, both have the same > error: > {code:none} > tmp = f_df\ > .join(visitorId_toInt, > f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\ > .select(f_df["idSite"], f_df["servertimestamp"], > visitorId_toInt["int_visitorId"],\ > f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], > f_df["pageId"] ) > > ratings_df = > tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\ > .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\ > tmp["sessionId"],tmp["serverTimePretty"], > productId_toInt["int_productId"], tmp["pageId"] ) > {code} > The SQL version: > {code:none} > ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, > sessionId, serverTimePretty, int_productId, pageId \ > FROM f_df \ > INNER JOIN visitorId_toInt \ > ON f_df.visitorId = visitorId_toInt.visitorId \ > INNER JOIN productId_toInt \ > ON f_df.productId = productId_toInt.productId \ > ") > {code} > Here is the result of the join: > {code:none} > >>>ratings_df.take(3) > [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, > sessionId=None, serverTimePretty=u'377347895', int_productId=7936, > pageId=100), > Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, > sessionId=None, serverTimePretty=u'377347895', int_productId=7936, > pageId=100), > Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, > sessionId=None, serverTimePretty=u'375989339', int_productId=1060, > pageId=100)] > {code} > The idSite in my dataset is 100 for all the rows, and some how thats being > assigned to pageId. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org