[ 
https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171682#comment-17171682
 ] 

Jayce Jiang edited comment on SPARK-32515 at 8/5/20, 6:45 PM:
--------------------------------------------------------------

Okay, I am trying to get all distinct value for the "username" columns of this 
df.

The expect results is what is I show in the 
filter_df.toPandas()["username"].unique.

The result was all usernames are all in the correct format, the username 
columns only contain characters [a-z][A-Z][0-9] and the underscore.  
!unknown.png|width=631,height=251! For example, "danielrainge", "dgreen_14".

 What I am actually getting.

The problem is when I use spark function instead of converting to a pandas 
dataframe first. As you see in the image. In [134], when I do the collect() 
method, I am getting result like [["#classic" |#classic" ]] , and random result 
with bracket [], those results shouldn't be there, all the string in the 
username column does not contain bracket or hashtags #. 

!unknown1.png|width=576,height=272!

I am trying it in Google Collab right now, and see if it is a Jupyter notebook 
problem. Will keep you updated. 

 


was (Author: tigaiii123):
Okay,

The expect results is what is I show in the 
filter_df.toPandas()["username"].unique.

The result was all usernames are all in the correct format, the username 
columns only contain characters [a-z][A-Z][0-9] and the underscore.  
!unknown.png|width=631,height=251! For example, "danielrainge", "dgreen_14".

 

The problem is when I use spark function instead of converting to a pandas 
dataframe first. As you see in the image. In [134], when I do the collect() 
method, I am getting result string like [["#classic"|#classic"]] , and random 
result with bracket [], those result shouldn't be there, all the string in the 
username column does not contain bracket or hashtags #. 

!unknown1.png|width=576,height=272!

I am trying it in google colab right now, and see if it is a Jupyter notebook 
problem. Will keep you updated

 

> Distinct Function Weird Bug
> ---------------------------
>
>                 Key: SPARK-32515
>                 URL: https://issues.apache.org/jira/browse/SPARK-32515
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6
>         Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>            Reporter: Jayce Jiang
>            Priority: Major
>         Attachments: Capture.PNG, Capture1.png, Capture2.PNG, 
> image-2020-08-03-07-03-55-716.png, unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file 
> into spark and trying to do check all distinct value from a column inside of 
> a dataframe. Everything I try in spark resulted in a wrong answer. But if I 
> convert my spark dataframe into pandas dataframe, it works. Please help. This 
> bug only happens in this one CSV file, all my other CSV files work properly. 
> Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to