[GitHub] [superset] john-bodley opened a new pull request, #20151: fix(csv): Ensure df_to_escaped_csv handles NULL

GitBox Fri, 20 May 2022 15:42:16 -0700


john-bodley opened a new pull request, #20151:
URL: https://github.com/apache/superset/pull/20151


   <!---
   Please write the PR title following the conventions at 
https://www.conventionalcommits.org/en/v1.0.0/
   Example:
   fix(dashboard): load charts correctly
   -->
   
   ### SUMMARY
   
   When exporting a SQL Lab result set with a integer column containing `NULL` 
values, Numpy/Pandas coerces them to floats during the 
`pd.DataFrame.applymap(...)` call given that `NaN` is actually a float, i.e., 
   
   ```
   >>> import pyarrow as pa
   >>>
   >>> df = pa.array([1, 2, 
None]).to_pandas(integer_object_nulls=True).to_frame()
   >>> df
         0
   0     1
   1     2
   2  None
   >>> df.applymap(lambda x: x)
        0
   0  1.0
   1  2.0
   2  NaN
   ```
   
   Type coercion in Pandas is overly magical and often undesirable. The fix is 
somewhat yuck as well, iterating over the columns and rows of the DataFrame and 
escaping only those cells which are actually `str`.
   
   I tried a number of more performant/vectorized solutions but at the end of 
the day they rely on `numpy.array`s which require declared types otherwise 
coercion is performed. Note the `numpy.dtype(object)`  data type is used to 
handled both strings an integers—as a byproduct of how data is exported from 
PyArrow—given how special handling is required for handling missing values with 
integers per 
[here](https://pandas.pydata.org/docs/user_guide/gotchas.html#support-for-integer-na).
 The TL;DR is this is all very unpleasant.
   
   ### BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
   <!--- Skip this if not applicable -->
   
   ### TESTING INSTRUCTIONS
   
   Added unit tests.
   
   ### ADDITIONAL INFORMATION
   <!--- Check any relevant boxes with "x" -->
   <!--- HINT: Include "Fixes #nnn" if you are fixing an existing issue -->
   - [ ] Has associated issue:
   - [ ] Required feature flags:
   - [ ] Changes UI
   - [ ] Includes DB Migration (follow approval process in 
[SIP-59](https://github.com/apache/superset/issues/13351))
     - [ ] Migration is atomic, supports rollback & is backwards-compatible
     - [ ] Confirm DB migration upgrade and downgrade tested
     - [ ] Runtime estimates and downtime expectations provided
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [superset] john-bodley opened a new pull request, #20151: fix(csv): Ensure df_to_escaped_csv handles NULL

Reply via email to