[
https://issues.apache.org/jira/browse/SPARK-49616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Daniel Pazeto Jr updated SPARK-49616:
-------------------------------------
Description:
When I'm reading some json payloads PySpark is changing the data even if I read
it as a StringType and I want this as a String because I don't want to have
each field as a column at this step. I just want to get this payload as String
as it is in payload/source file
Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark
version: 3.3.0+amzn.1.dev0
Here my payload/source (test.txt):
{"payload":\{"points":1220000000}}
\{"payload":{"count":1550554545.0}}
\{"payload":{"points":125888002540.0, "count":1550554545.0}}
\{"payload":{"name": "Roger", "count":55154111.0}}
Here my code:
path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema = StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)
Here the result where PySpark is setting the float number in scientific
notation, even when I read it as String.
+------------------------------------------------+
|payload |
+------------------------------------------------+
|\{"points":1220000000} |
|\{"count":1.550554545E9} |
|\{"points":1.2588800254E11,"count":1.550554545E9}|
|\{"name":"Roger","count":5.5154111E7} |
+------------------------------------------------+
Why I can't simply have my data as it is? Why the final result is changed into
my string field and receive this scientific notation? i.e:
{quote}"count":1550554545.0
"count":1.550554545E9
{quote}
was:
File is having below data
DAta
1200404151072.121111111111
1200404151073
1200404151074.1232323
1200404151075.124344
1200404151076.12
1200404151077.123433333
1200404151078.12
1200404151079.12544545454554
1251080.1234444444444444444444
10000000000000000000000000000
Spark is reading with scientific notation as we wanted to read data as it is
available in file with accurate datatype not with string datatype.
+--------------------+
| DAta|
+--------------------+
|1.200404151072121E12|
| 1.200404151073E12|
|1.200404151074123...|
|1.200404151075124...|
| 1.20040415107612E12|
|1.200404151077123...|
| 1.20040415107812E12|
|1.200404151079125...|
| 1251080.1234444445|
| 1.0E28|
+--------------------
> Spark reading data in scientific notation in String field
> ---------------------------------------------------------
>
> Key: SPARK-49616
> URL: https://issues.apache.org/jira/browse/SPARK-49616
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 3.1.0, 3.3.0
> Reporter: Daniel Pazeto Jr
> Priority: Major
>
> When I'm reading some json payloads PySpark is changing the data even if I
> read it as a StringType and I want this as a String because I don't want to
> have each field as a column at this step. I just want to get this payload as
> String as it is in payload/source file
> Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark
> version: 3.3.0+amzn.1.dev0
> Here my payload/source (test.txt):
> {"payload":\{"points":1220000000}}
> \{"payload":{"count":1550554545.0}}
> \{"payload":{"points":125888002540.0, "count":1550554545.0}}
> \{"payload":{"name": "Roger", "count":55154111.0}}
> Here my code:
> path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
> schema = StructType([StructField('payload', StringType(), True)])
> my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
> my_df.show(truncate=False)
> Here the result where PySpark is setting the float number in scientific
> notation, even when I read it as String.
> +------------------------------------------------+
> |payload |
> +------------------------------------------------+
> |\{"points":1220000000} |
> |\{"count":1.550554545E9} |
> |\{"points":1.2588800254E11,"count":1.550554545E9}|
> |\{"name":"Roger","count":5.5154111E7} |
> +------------------------------------------------+
> Why I can't simply have my data as it is? Why the final result is changed
> into my string field and receive this scientific notation? i.e:
> {quote}"count":1550554545.0
> "count":1.550554545E9
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]