[ 
https://issues.apache.org/jira/browse/SPARK-49616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pazeto Jr updated SPARK-49616:
-------------------------------------
    Description: 
When I'm reading some json payloads PySpark is changing the data even if I read 
it as a StringType and I want this as a String because I don't want to have 
each field as a column at this step. I just want to get this payload as String 
as it is in payload/source file
Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark 
version: 3.3.0+amzn.1.dev0

Here my payload/source (test.txt):
{"payload":\{"points":1220000000}}
\{"payload":{"count":1550554545.0}}
\{"payload":{"points":125888002540.0, "count":1550554545.0}}
\{"payload":{"name": "Roger", "count":55154111.0}}

Here my code:
path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema =  StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)

Here the result where PySpark is setting the float number in scientific 
notation, even when I read it as String.
+------------------------------------------------+
|payload                                         |
+------------------------------------------------+
|\{"points":1220000000}                           |
|\{"count":1.550554545E9}                         |
|\{"points":1.2588800254E11,"count":1.550554545E9}|
|\{"name":"Roger","count":5.5154111E7}            |
+------------------------------------------------+

Why I can't simply have my data as it is? Why the final result is changed into 
my string field and receive this scientific notation? i.e:
{quote}"count":1550554545.0
"count":1.550554545E9
{quote}

  was:
File is having below data

DAta
1200404151072.121111111111
1200404151073
1200404151074.1232323
1200404151075.124344
1200404151076.12
1200404151077.123433333
1200404151078.12
1200404151079.12544545454554
1251080.1234444444444444444444
10000000000000000000000000000

 

Spark is reading with scientific notation as we wanted to read data as it is 
available in file with accurate datatype not with string datatype.

+--------------------+
| DAta|
+--------------------+
|1.200404151072121E12|
| 1.200404151073E12|
|1.200404151074123...|
|1.200404151075124...|
| 1.20040415107612E12|
|1.200404151077123...|
| 1.20040415107812E12|
|1.200404151079125...|
| 1251080.1234444445|
| 1.0E28|
+--------------------

 

 

 


> Spark reading data in scientific notation in String field
> ---------------------------------------------------------
>
>                 Key: SPARK-49616
>                 URL: https://issues.apache.org/jira/browse/SPARK-49616
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.1.0, 3.3.0
>            Reporter: Daniel Pazeto Jr
>            Priority: Major
>
> When I'm reading some json payloads PySpark is changing the data even if I 
> read it as a StringType and I want this as a String because I don't want to 
> have each field as a column at this step. I just want to get this payload as 
> String as it is in payload/source file
> Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark 
> version: 3.3.0+amzn.1.dev0
> Here my payload/source (test.txt):
> {"payload":\{"points":1220000000}}
> \{"payload":{"count":1550554545.0}}
> \{"payload":{"points":125888002540.0, "count":1550554545.0}}
> \{"payload":{"name": "Roger", "count":55154111.0}}
> Here my code:
> path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
> schema =  StructType([StructField('payload', StringType(), True)])
> my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
> my_df.show(truncate=False)
> Here the result where PySpark is setting the float number in scientific 
> notation, even when I read it as String.
> +------------------------------------------------+
> |payload                                         |
> +------------------------------------------------+
> |\{"points":1220000000}                           |
> |\{"count":1.550554545E9}                         |
> |\{"points":1.2588800254E11,"count":1.550554545E9}|
> |\{"name":"Roger","count":5.5154111E7}            |
> +------------------------------------------------+
> Why I can't simply have my data as it is? Why the final result is changed 
> into my string field and receive this scientific notation? i.e:
> {quote}"count":1550554545.0
> "count":1.550554545E9
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to