zero323 edited a comment on pull request #35296:
URL: https://github.com/apache/spark/pull/35296#issuecomment-1025158713
> @zero323 This PR won't have any impact on your first example because there
is at least one value that is not None. In your second example it will be
without this PR.
It seems like there is still some confusion regarding actual impact of the
changes that are proposed here.
So let's start with establishing simple fact ‒ Spark uses *row*-oriented
JSON lines format when JSON (that includes `to_json`) is used.
`ignoreNullFields` option, by default set to `true` controls writer behavior
when missing values are encountered:
- If it is set to `true`, then field is omitted in the output for a *given
row*.
```python
>>> import tempfile
>>> path_true_some_null = tempfile.mktemp()
>>> df_some_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]})
>>> df_some_null.to_json(path_true_some_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_some_null).collect():
... print(x.value)
{"id":2,"val":1.0}
{"id":1}
```
- If it is set to `false`, then field is emitted with JSON `null` (untyped)
value.
```python
>>> for x in spark.read.text(path_false_some_null).collect():
... print(x.value)
{"id":1,"val":null}
{"id":2,"val":1.0}
```
Decision is made on row-by-row basis, and is not affected by presence of
missing values for the same field in any other row.
Your PR changes default behavior for `pyspark.pandas` JSON writer, from the
former (missing values are omitted) to the latter (missing values are
preserved). This impacts output, as long as there is any missing value in any
field in any row. Furthermore, it can have significant (proportional to N * M,
for N rows and M columns) impact in case of sparse datasets (IO cost, storage
cost) in case of sparse datasets.
Finally, counting columns in the re-read dataset is simply misleading and
your PR doesn't and cannot affect *reader* behavior (it just affects what
reader "sees") and, as mentioned before, explicitly writing missing values
cannot resolve the problem of properly restoring original schema. Consider the
following:
```python
>>> import numpy as np
>>> df_all_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan,
np.nan]})
```
If you write it with `ignoreNullFields` set `true`, it omits the missing
values same as in the case where some not missing values are present.
```python
>>> path_true_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_true_all_null, ignoreNullFields=True)
>>> for x in spark.read.text(path_true_all_null).collect():
... print(x.value)
{"id":2}
{"id":1}
```
but there is no trace that other column was ever there.
```python
>>> ps.read_json(path_true_all_null)
id
0 2
```
Setting `ignoreNullFields` set `false`
```python
>>> path_false_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_false_all_null, ignoreNullFields=False)
>>> for x in spark.read.text(path_false_all_null).collect():
... print(x.value)
{"id":1,"val":null}
{"id":2,"val":null}
```
writes JSON `nulls` explicitly (same as in case of (`df_some_null`), and by
doing so, provides enough information for the reader to infer that *some*
column should be present.
```python
>>> df_all_null_read_with_false = ps.read_json(path_false_all_null)
id val
0 1 None
1 2 None
```
So, does it restore the original schema?
It doesn't. The original contained double values:
```python
>>> df_all_null.to_spark().printSchema()
root
|-- id: long (nullable = false)
|-- val: double (nullable = true)
```
and restored one cannot make any assumptions about the types, so it defaults
to strings:
```python
>>> df_all_null_read_with_false.to_spark().printSchema()
root
|-- id: long (nullable = true)
|-- val: string (nullable = true)
```
If user requires specific schema for the frame, schema should be provided on
read, which will give the same result independent of specific fields being
present in the output or not
```python
>>> ps.read_json(path_false_all_null, schema="id long, val
double").to_spark().printSchema()
root
|-- id: long (nullable = true)
|-- val: double (nullable = true)
```
This is standard approach in production pipelines and, on top of
consistency, provides significant performance improvements on realistic size
data.
So to re-iterate:
- This PR affects the output, as long as dataset contains at least one
missing value, and can have significant impact on performance and storage costs.
- Observed impact on reader behavior is more incidental and doesn't
guarantee correct schema of the loaded data.
What this PR really achieves, is making output of `to_json(some_path)`
```python
>>> path_pr_all_null = tempfile.mktemp()
>>> df_all_null.to_json(path_pr_all_null)
>>> for x in spark.read.text(path_pr_all_null).collect():
... print(x.value)
{"id":2,"val":null}
{"id":1,"val":null}
```
*roughly* equivalent to the output of `to_json()`
```python
>>> df_all_null.to_json()
'[{"id":1,"val":null},{"id":2,"val":null}]'
```
and pandas equivalents.
I wouldn't bother with pointing all of that out, but I have enough
experience with users making incorrect assumptions about Spark behavior and
nature of fixes, based on misleading JIRA tickets.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]