mistercrunch commented on issue #4588: [sqllab] How can we make large Superset 
queries load faster?
URL: 
https://github.com/apache/incubator-superset/issues/4588#issuecomment-372197841
 
 
   So say a user runs a query that returns 1M rows. Now we know that they're 
not going to read those 1M rows like it's a book, and 1K usually is a decent 
sample. Now what are some legit things they might do that requires 1M rows?
   * exporting the CSV to excel (different button)
   * searching client-side using the instant filtering table search 
functionality, which isn't a super-legit use-case since they could use a `LIKE` 
statement or something equivalent, though `LIKE` assumes you know the column. 
Anyhow. Returning 1M rows to full-text search a table isn't great.
   * infinite scrolling looking for patterns?
   * sort a column and use pagination?
   
   In any case, it may be a good thing to always do 1K and allow them to push a 
button to get the whole set (and wait longer, maybe crash their browser). 
Perhaps the data table shows a msg at the top `showing first 1k rows, click 
here to load the entire data set`.
   
   I haven't looked at the implementation of `fetchmany` for Presto/Hive in 
pyhive, but in theory given the `dbapi` spec you should be able to 
`fetchmany(1000)` and then fetchmany(MAX_ROWS). Assuming all this is happening 
on the worker (async), you'd ship the 1k to a first s3 location, and the whole 
set to another later on. That implies the introduction of:
   * a new query state `partially_saved`
   * perhaps some `db_engine_spec` methods if the `two_phase_save` need some 
specific handling on different engines
   * frontend logic that understands `partially_saved`, and exposes that to the 
user, allowing them to fetch the whole data set. CSV export should probably be 
greyed until the second phase is saved
   * configuration elements to turn this on and off (per db? per db engine as a 
BaseEngineSpec class attr?)
   * introduce 2 environment configuration settings `PARTIAL_SAVE_ROW_COUNT`, 
`MAX_SAVE_ROW_COUNT` or something equivalent?
   
   Oh and I wanted to make it clear that we absolutely should not run the query 
twice, that's just not right.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to