zero323 commented on pull request #35296:
URL: https://github.com/apache/spark/pull/35296#issuecomment-1025158713


   > @zero323 This PR won't have any impact on your first example because there 
is at least one value that is not None.  In your second example it will be 
without this PR.
   
   It seems like there is still some confusion regarding actual impact of the 
changes that are proposed here. 
   
   So let's start with establishing simple fact ‒ Spark uses *row*-oriented 
JSON lines format when JSON (that includes `to_json`) is used. 
`ignoreNullFields` option, by default set to `true` controls writer behavior 
when missing values are encountered:
   
   - If it is set to `true`,  then field is omitted in the output for a *given 
row*.
   
     ```python
     >>> import tempfile
     >>> path_true_some_null = tempfile.mktemp()
     >>> df_some_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [None, 1]})
     >>> df_some_null.to_json(path_true_some_null, ignoreNullFields=True)
     >>> for x in spark.read.text(path_true_some_null).collect():
     ...     print(x.value)
     {"id":2,"val":1.0}
     {"id":1}
     ```
   
   - If it is set to `false`, then field is emitted with JSON `null` (untyped) 
value.
     
      ```python
     >>> for x in spark.read.text(path_false_some_null).collect():
     ...     print(x.value)
     {"id":1,"val":null}
     {"id":2,"val":1.0}
     ```
   
   Decision is made on row-by-row basis, and is not affected by presence of 
missing values for the same field in any other row. 
   
   Your PR changes default behavior for `pyspark.pandas` JSON writer, from the 
former (missing values are omitted) to the latter (missing values are 
preserved). This impacts output, as long as there is any missing value in any 
field in any row. Furthermore, it can have significant (proportional to N * M, 
for N rows and M columns) impact in case of sparse datasets (IO cost, storage 
cost) in case of sparse datasets.
   
   Finally, counting columns in the re-read dataset is simply misleading and 
your PR doesn't and cannot affect *reader* behavior (it just affects what 
reader "sees") and, as mentioned before, explicitly writing missing values 
cannot resolve the problem of properly restoring original schema. Consider the 
following:
   
   ```python
   >>> import numpy as np
   >>> df_all_null = ps.DataFrame.from_dict({"id": [1, 2], "val": [np.nan, 
np.nan]})
   ```
   
   If you write it with `ignoreNullFields` set `true`, it omits the missing 
values same as in the case where some not missing values are present.
   
   ```python
   >>> path_true_all_null = tempfile.mktemp()
   >>> df_all_null.to_json(path_true_all_null, ignoreNullFields=True)
   >>> for x in spark.read.text(path_true_all_null).collect():
   ...     print(x.value)
   {"id":2}
   {"id":1}
   ```
   
   but there is no trace that other column was ever there.
   
   ```python
   >>> ps.read_json(path_true_all_null)
      id
   0   2
   ```
   
   Setting `ignoreNullFields` set `false`
   
   ```python
   >>> path_false_all_null = tempfile.mktemp()
   >>> df_all_null.to_json(path_false_all_null, ignoreNullFields=False)
   >>> for x in spark.read.text(path_false_all_null).collect():
   ...     print(x.value)
   {"id":1,"val":null}
   {"id":2,"val":null}
   ```
   
   writes JSON `nulls` explicitly (same as in case of  (`df_some_null`), and by 
doing so, provides enough information for the reader to infer that *some* 
column should be present.
   
   ```python
   >>> df_all_null_read_with_false = ps.read_json(path_false_all_null)
      id   val
   0   1  None
   1   2  None
   ```
   
   So, does it restore the original schema? 
   
   It doesn't. The original contained double values:
   
   ```python
   
   >>> df_all_null.to_spark().printSchema()
   root
    |-- id: long (nullable = false)
    |-- val: double (nullable = true)
   
   ```
   
   and restored one cannot make any assumptions about the types, so it defaults 
to strings:
   
   ```python
   >>> df_all_null_read_with_false.to_spark().printSchema()
   root
    |-- id: long (nullable = true)
    |-- val: string (nullable = true)
   ```
   
   If user requires specific schema for the frame, schema should be provided on 
read, which will give the same result independent of specific fields being 
present in the output or not
   
   ```python
   >>> ps.read_json(path_false_all_null, schema="id long, val 
double").to_spark().printSchema()
   root
    |-- id: long (nullable = true)
    |-- val: double (nullable = true)
   ```
   
   This is standard approach in production pipelines and, on top of 
consistency, provides significant performance improvements on realistic size 
data.
   
   So to re-iterate:
   
   - This PR affects the output, as long as dataset contains at least one 
missing value, and can have significant impact on performance and storage costs.
   - Observed impact on reader behavior is more incidental and doesn't 
guarantee correct schema of the loaded data.
   
   What this PR really achieves, is making output of `to_json(some_path)`
   
   ```python
   >>> path_pr_all_null = tempfile.mktemp()
   >>> df_all_null.to_json(path_pr_all_null)
   >>> for x in spark.read.text(path_pr_all_null).collect():
   ...     print(x.value)
   {"id":2,"val":null}
   {"id":1,"val":null}
   ```
   
   roughly equivalent to the output `to_json()`
   
   ```python
   >>> df_all_null.to_json()
   '[{"id":1,"val":null},{"id":2,"val":null}]'
   ```
   
   and pandas equivalents.
   
   
   I wouldn't bother with pointing all of that out, but I have enough 
experience with users making incorrect assumptions about Spark behavior and 
nature of fixes, based on misleading JIRA tickets.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to