YuriyKrasilnikov commented on PR #37516:
URL: https://github.com/apache/superset/pull/37516#issuecomment-4763022674
Thanks again for the rebase and for pointing out the conflict areas.
I re-reviewed the PR from current master through the rebased branch,
especially the `get_df_payload()` area where the BigQuery warning/memory-limit
handling and the `applied_filter_columns` cache-miss logic landed near the
timing code.
## What changed
I kept `CHART_DATA_INCLUDE_TIMING` as a config option with default `False`
for now.
I agree this means the API response does not include timing out of the box,
so it does not fully answer #37507 by default. But after looking at the
security/API-contract concern again, I think default-on timing should be an
explicit maintainer decision. The current behavior is:
- timing is still always collected internally
- `STATS_LOGGER` metrics and slow-query logging still work regardless of the
config
- API response timing stays opt-in via `CHART_DATA_INCLUDE_TIMING = True`
If the preferred direction is default `True`, or moving this behind a
feature flag instead of config, I can switch it.
## Root cause / fix
The main issue I found is that timing was being finalized too early in
`get_df_payload()`. That missed later response work like `_get_full()`
serialization, `query_context.get_data()`, and the `POST_PROCESSED` path
through `apply_client_processing()`.
The fix now carries timing as internal query state until command/response
finalization, then strips the internal key before returning the public
response. This makes `result_processing_ms` and `total_ms` include the
response-shaping work.
## Response consistency
I also fixed the `_get_full` inconsistency you called out:
- when response timing is disabled, `RESULTS` no longer returns `timing:
null`
- `FULL` and `RESULTS` both omit timing unless the config is enabled
- `db_execution_ms` is explicitly `null` on cache hits
Separately, CodeAnt pointed out a related `RESULTS` flow mismatch around
`is_cached`: `_get_full()` reduces the payload and does not carry top-level
`is_cached`, so the timing finalizer could no longer derive the cache state
from the reduced result. I fixed that by carrying `is_cached` inside the
internal timing state before reduction, and keeping public `timing.is_cached`
as a strict boolean.
Also addressed the older Bito notes by removing the no-op `db_execution_ms`
assignment and documenting response-only `timing.is_cached`.
One related issue I found while reviewing the conflict area: when an old
cached payload is invalidated because it has no `applied_filter_columns`, the
code re-runs the query but could still keep cache-hit metadata like
`is_cached=True` / `cached_dttm`. I cleared that metadata in the invalidation
path so the reloaded result is reported as a miss.
Added/updated tests for full/results responses, deferred finalization, cache
hits/misses, stale-cache invalidation, and the disabled timing response path.
Latest push is `945d7048a8`.
P.S. While reviewing the `POST_PROCESSED` response path, I noticed a
possible adjacent CSV issue that looks separate from this timing PR.
The path is:
- `QueryContextProcessor.get_data()` can return CSV data as `bytes` after
applying `CSV_EXPORT["encoding"]`
- `_send_chart_response()` calls `apply_client_processing()` before the
final response is built for `POST_PROCESSED`
- `apply_client_processing()` then reads CSV data with `StringIO(data)`,
which expects `str`, not `bytes`
<details>
<summary>Relevant code paths</summary>
```python
# superset/common/query_context_processor.py
if self._query_context.result_format == ChartDataResultFormat.CSV:
result = csv.df_to_escaped_csv(
df, index=include_index, **current_app.config["CSV_EXPORT"]
)
result = result.encode(
current_app.config["CSV_EXPORT"].get("encoding", "utf-8")
)
```
```python
# superset/charts/data/api.py
if result_type == ChartDataResultType.POST_PROCESSED:
result = apply_client_processing(result, form_data, datasource)
```
```python
# superset/charts/client_processing.py
elif query["result_format"] == ChartDataResultFormat.CSV:
df = pd.read_csv(
StringIO(data),
keep_default_na=na_values is None,
na_values=na_values,
)
```
</details>
I found #36682 / #38616 in the same area, but those seem focused on
`CSV_EXPORT` delimiter/decimal handling rather than the exact `bytes` vs
`StringIO` case. I did not change it here to keep this PR scoped to timing, but
it may be worth checking separately.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]