mistercrunch commented on issue #4588: [sqllab] How can we make large Superset queries load faster? URL: https://github.com/apache/incubator-superset/issues/4588#issuecomment-372197841 So say a user runs a query that returns 1M rows. Now we know that they're not going to read those 1M rows like it's a book, and 1K usually is a decent sample. Now what are some legit things they might do that requires 1M rows? * exporting the CSV to excel (different button) * searching client-side using the instant filtering table search functionality, which isn't a super-legit use-case since they could use a `LIKE` statement or something equivalent, though `LIKE` assumes you know the column. Anyhow. Returning 1M rows to full-text search a table isn't great. * infinite scrolling looking for patterns? * sort a column and use pagination? In any case, it may be a good thing to always do 1K and allow them to push a button to get the whole set (and wait longer, maybe crash their browser). Perhaps the data table shows a msg at the top `showing first 1k rows, click here to load the entire data set`. I haven't looked at the implementation of `fetchmany` for Presto/Hive in pyhive, but in theory given the `dbapi` spec you should be able to `fetchmany(1000)` and then fetchmany(MAX_ROWS). Assuming all this is happening on the worker (async), you'd ship the 1k to a first s3 location, and the whole set to another later on. That implies the introduction of: * a new query state `partially_saved` * perhaps some `db_engine_spec` methods if the `two_phase_save` need some specific handling on different engines * frontend logic that understands `partially_saved`, and exposes that to the user, allowing them to fetch the whole data set. CSV export should probably be greyed until the second phase is saved * configuration elements to turn this on and off (per db? per db engine as a BaseEngineSpec class attr?) * introduce 2 environment configuration settings `PARTIAL_SAVE_ROW_COUNT`, `MAX_SAVE_ROW_COUNT` or something equivalent? Oh and I wanted to make it clear that we absolutely should not run the query twice, that's just not right.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services