[PR] Pivoted CSV export fix when CSV_EXPORT values are not default ones [superset]

via GitHub Mon, 18 Nov 2024 05:51:19 -0800


frlm opened a new pull request, #30961:
URL: https://github.com/apache/superset/pull/30961


   **Title:** fix(csv_export): use custom CSV_EXPORT parameters in pd.read_csv
   
   ### Bug description
   
   Function: apply_post_process
   
   The issue is that `pd.read_csv` uses the default values of pandas instead of 
the parameters defined in `CSV_EXPORT` in `superset_config`. This problem is 
rarely noticeable when using the separator `,` and the decimal `.`. However, 
with the configuration `CSV_EXPORT='{"encoding": "utf-8", "sep": ";", 
"decimal": ","}'`, the issue becomes evident. This change ensures that 
`pd.read_csv` uses the parameters defined in `CSV_EXPORT`.
   
   **Steps to reproduce error**: 
   
   - Configure `CSV_EXPORT` with the following parameters:
      ```python
      CSV_EXPORT = {
          "encoding": "utf-8",
          "sep": ";",
          "decimal": ","
      }
   - Open a default chart in Superset of the Pivot Table type. In this example, 
we are using Pivot Table v2 within the USA Births Names dashboard:
   
   
![image](https://github.com/user-attachments/assets/8389a6d0-91f2-455b-b29b-b2b928a09d2a)
   
   - Click on Download > **Export to Pivoted .CSV**
   
![image](https://github.com/user-attachments/assets/06f8e0d2-115e-4040-a129-3686d4e68c84)
   
   - Download is blocked by an error.
   
   
   
   **Cause**:  The error is generated by an anomaly in the input DataFrame df, 
which has the following format (a single column with all distinct fields 
separated by a semicolon separator):
   
   ```
   ,state;name;sum__num
   0,other;Michael;1047996
   1,other;Christopher;803607
   2,other;James;749686
   ```
   
   
   **Fix**: Added a bug fix to read data with right CSV_EXPORT settings
   
   **Code Changes:**
   
   ~~~python
           elif query["result_format"] == ChartDataResultFormat.CSV:
               df = pd.read_csv(StringIO(data), 
                                delimiter=superset_config.CSV_EXPORT.get('sep'),
                                
encoding=superset_config.CSV_EXPORT.get('encoding'),
                                
decimal=superset_config.CSV_EXPORT.get('decimal'))
   ~~~
   
   
   **Complete Code**
   
   ~~~python
   
   def apply_post_process(
       result: dict[Any, Any],
       form_data: Optional[dict[str, Any]] = None,
       datasource: Optional[Union["BaseDatasource", "Query"]] = None,
   ) -> dict[Any, Any]:
       form_data = form_data or {}
   
       viz_type = form_data.get("viz_type")
       if viz_type not in post_processors:
           return result
   
       post_processor = post_processors[viz_type]
   
       for query in result["queries"]:
           if query["result_format"] not in (rf.value for rf in 
ChartDataResultFormat):
               raise Exception(  # pylint: disable=broad-exception-raised
                   f"Result format {query['result_format']} not supported"
               )
   
           data = query["data"]
   
           if isinstance(data, str):
               data = data.strip()
   
           if not data:
               # do not try to process empty data
               continue
   
           if query["result_format"] == ChartDataResultFormat.JSON:
               df = pd.DataFrame.from_dict(data)
           elif query["result_format"] == ChartDataResultFormat.CSV:
               df = pd.read_csv(StringIO(data), 
                                delimiter=superset_config.CSV_EXPORT.get('sep'),
                                
encoding=superset_config.CSV_EXPORT.get('encoding'),
                                
decimal=superset_config.CSV_EXPORT.get('decimal'))
               
           # convert all columns to verbose (label) name
           if datasource:
               df.rename(columns=datasource.data["verbose_map"], inplace=True)
   
           processed_df = post_processor(df, form_data, datasource)
   
           query["colnames"] = list(processed_df.columns)
           query["indexnames"] = list(processed_df.index)
           query["coltypes"] = extract_dataframe_dtypes(processed_df, 
datasource)
           query["rowcount"] = len(processed_df.index)
   
           # Flatten hierarchical columns/index since they are represented as
           # `Tuple[str]`. Otherwise encoding to JSON later will fail because
           # maps cannot have tuples as their keys in JSON.
           processed_df.columns = [
               " ".join(str(name) for name in column).strip()
               if isinstance(column, tuple)
               else column
               for column in processed_df.columns
           ]
           processed_df.index = [
               " ".join(str(name) for name in index).strip()
               if isinstance(index, tuple)
               else index
               for index in processed_df.index
           ]
   
           if query["result_format"] == ChartDataResultFormat.JSON:
               query["data"] = processed_df.to_dict()
           elif query["result_format"] == ChartDataResultFormat.CSV:
               buf = StringIO()
               processed_df.to_csv(buf)
               buf.seek(0)
               query["data"] = buf.getvalue()
   
       return result
   
   ~~~
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

[PR] Pivoted CSV export fix when CSV_EXPORT values are not default ones [superset]

Reply via email to