[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

venu k tangirala (JIRA) Tue, 05 Apr 2016 22:46:25 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


venu k tangirala updated SPARK-14425:
-------------------------------------
    Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
         f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
         tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
         f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
         tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:py}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 


> spark SQL/dataframe join error: mixes the columns up
> ----------------------------------------------------
>
>                 Key: SPARK-14425
>                 URL: https://issues.apache.org/jira/browse/SPARK-14425
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.6.1
>         Environment: databricks cloud
>            Reporter: venu k tangirala
>            Priority: Blocker
>
> I am running this on databricks cloud.
> I am running a join operation and the result has a the columns mixed up. 
> Here is an example:
> the original df:
> >>>df.take(3)
> [Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
>  Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
>  Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]
> As I am trying to build a recommendation system, and ALS Ratings has to be 
> user_is and product_id has to be int, I am mapping them as follows: 
> {code:none}
> # mapping string to int for visitors
> visitorId_toInt = f_df.map(lambda 
> x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
> # print visitorId_toInt.take(3)
> visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
> SQL
> # mapping long to int for products
> productId_toInt= f_df.map(lambda 
> x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
> # print productId_toInt.take(3)
> productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
> SQL
> f_df.registerTempTable("f_df") #doing this only for the SQL
> {code}
> Now I do the join and get the int versions of user_id and product_id as 
> follows, I tried it with both dataFrame join and SQL join, both have the same 
> error:
> {code:none}
> tmp = f_df\
> .join(visitorId_toInt, 
> f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
> .select(f_df["idSite"], f_df["servertimestamp"], 
> visitorId_toInt["int_visitorId"],\
>          f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
> f_df["pageId"] )
>   
> ratings_df = 
> tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
> .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
>          tmp["sessionId"],tmp["serverTimePretty"], 
> productId_toInt["int_productId"], tmp["pageId"] )
> {code}
> The SQL version:
> {code:none}
> ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
> sessionId, serverTimePretty, int_productId,  pageId \
> FROM f_df \
> INNER JOIN visitorId_toInt \
> ON f_df.visitorId = visitorId_toInt.visitorId \
> INNER JOIN productId_toInt \
> ON f_df.productId = productId_toInt.productId \
> ")
> {code}
> Here is the result of the join: 
> {code:none}
> >>>ratings_df.take(3)
> [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
> sessionId=None, serverTimePretty=u'375989339', int_productId=1060, 
> pageId=100)]
> {code}
> The idSite in my dataset is 100 for all the rows, and some how thats being 
> assigned to pageId. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

Reply via email to