mistercrunch commented on issue #4588: [sqllab] How can we make large Superset
queries load faster?
So say a user runs a query that returns 1M rows. Now we know that they're
not going to read those 1M rows like it's a book, and 1K usually is a decent
sample. Now what are some legit things they might do that requires 1M rows?
* exporting the CSV to excel (different button)
* searching client-side using the instant filtering table search
functionality, which isn't a super-legit use-case since they could use a `LIKE`
statement or something equivalent, though `LIKE` assumes you know the column.
Anyhow. Returning 1M rows to full-text search a table isn't great.
* infinite scrolling looking for patterns?
* sort a column and use pagination?
In any case, it may be a good thing to always do 1K and allow them to push a
button to get the whole set (and wait longer, maybe crash their browser).
Perhaps the data table shows a msg at the top `showing first 1k rows, click
here to load the entire data set`.
I haven't looked at the implementation of `fetchmany` for Presto/Hive in
pyhive, but in theory given the `dbapi` spec you should be able to
`fetchmany(1000)` and then fetchmany(MAX_ROWS). Assuming all this is happening
on the worker (async), you'd ship the 1k to a first results backend location,
and the whole set to another later on. That implies the introduction of:
* a new query state `partially_saved`
* perhaps some `db_engine_spec` methods if the `two_phase_save` need some
specific handling on different engines
* frontend logic that understands `partially_saved`, and exposes that to the
user, allowing them to fetch the whole data set. CSV export should probably be
greyed until the second phase is saved
* configuration elements to turn this on and off (per db? per db engine as a
BaseEngineSpec class attr?)
* introduce 2 environment configuration settings `PARTIAL_SAVE_ROW_COUNT`,
`MAX_SAVE_ROW_COUNT` or something equivalent?
Oh and I wanted to make it clear that we absolutely should not run the query
twice, that's just not right.
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
Apache Git Services