[jira] [Updated] (SPARK-49616) Spark reading data in scientific notation in String field

Daniel Pazeto Jr (Jira) Thu, 12 Sep 2024 06:56:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-49616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Daniel Pazeto Jr updated SPARK-49616:
-------------------------------------
    Description: 
When I'm reading some json payloads PySpark is changing the data even if I read 
it as a StringType and I want this as a String because I don't want to have 
each field as a column at this step. I just want to get this payload as String 
as it is in payload/source file
Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark 
version: 3.3.0+amzn.1.dev0

Here my payload/source (test.txt):
{code:java}
{"payload":{"points":1220000000}}
{"payload":{"count":1550554545.0}}
{"payload":{"points":125888002540.0, "count":1550554545.0}}
{"payload":{"name": "Roger", "count":55154111.0}}{code}
Here my code:
{code:java}
path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema = StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False){code}
Here the result where PySpark is setting the float number in scientific 
notation, even when I read it as String.
{+}{+}{+}{+}
{code:java}
+------------------------------------------------+
|payload                                         |
+------------------------------------------------+
|{"points":1220000000}                           |
|{"count":1.550554545E9}                         |
|{"points":1.2588800254E11,"count":1.550554545E9}|
|{"name":"Roger","count":5.5154111E7}            |
+------------------------------------------------+ {code}
Why I can't simply have my data as it is? Why the final result is changed into 
my string field and receive this scientific notation? i.e:
{quote}"count":1550554545.0
"count":1.550554545E9
{quote}

  was:
When I'm reading some json payloads PySpark is changing the data even if I read 
it as a StringType and I want this as a String because I don't want to have 
each field as a column at this step. I just want to get this payload as String 
as it is in payload/source file
Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark 
version: 3.3.0+amzn.1.dev0

Here my payload/source (test.txt):
{"payload":\{"points":1220000000}}
\{"payload":{"count":1550554545.0}}
\{"payload":{"points":125888002540.0, "count":1550554545.0}}
\{"payload":{"name": "Roger", "count":55154111.0}}

Here my code:
path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
schema =  StructType([StructField('payload', StringType(), True)])
my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
my_df.show(truncate=False)

Here the result where PySpark is setting the float number in scientific 
notation, even when I read it as String.
+------------------------------------------------+
|payload                                         |
+------------------------------------------------+
|\{"points":1220000000}                           |
|\{"count":1.550554545E9}                         |
|\{"points":1.2588800254E11,"count":1.550554545E9}|
|\{"name":"Roger","count":5.5154111E7}            |
+------------------------------------------------+

Why I can't simply have my data as it is? Why the final result is changed into 
my string field and receive this scientific notation? i.e:
{quote}"count":1550554545.0
"count":1.550554545E9
{quote}


> Spark reading data in scientific notation in String field
> ---------------------------------------------------------
>
>                 Key: SPARK-49616
>                 URL: https://issues.apache.org/jira/browse/SPARK-49616
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 3.1.0, 3.3.0
>            Reporter: Daniel Pazeto Jr
>            Priority: Major
>
> When I'm reading some json payloads PySpark is changing the data even if I 
> read it as a StringType and I want this as a String because I don't want to 
> have each field as a column at this step. I just want to get this payload as 
> String as it is in payload/source file
> Locally I'm using Spark 3.3 in Jupiter Notebook with Glue 4 image PySpark 
> version: 3.3.0+amzn.1.dev0
> Here my payload/source (test.txt):
> {code:java}
> {"payload":{"points":1220000000}}
> {"payload":{"count":1550554545.0}}
> {"payload":{"points":125888002540.0, "count":1550554545.0}}
> {"payload":{"name": "Roger", "count":55154111.0}}{code}
> Here my code:
> {code:java}
> path = "/home/glue_user/workspace/jupyter_workspace/test/test.txt"
> schema = StructType([StructField('payload', StringType(), True)])
> my_df = spark.read.schema(schema).option("inferSchema", "false").json(path)
> my_df.show(truncate=False){code}
> Here the result where PySpark is setting the float number in scientific 
> notation, even when I read it as String.
> {+}{+}{+}{+}
> {code:java}
> +------------------------------------------------+
> |payload                                         |
> +------------------------------------------------+
> |{"points":1220000000}                           |
> |{"count":1.550554545E9}                         |
> |{"points":1.2588800254E11,"count":1.550554545E9}|
> |{"name":"Roger","count":5.5154111E7}            |
> +------------------------------------------------+ {code}
> Why I can't simply have my data as it is? Why the final result is changed 
> into my string field and receive this scientific notation? i.e:
> {quote}"count":1550554545.0
> "count":1.550554545E9
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-49616) Spark reading data in scientific notation in String field

Reply via email to