Just to clarify briefly, in hopes that future searchers will find this thread... ;)
IIUC at the moment, partition pruning and column pruning are all-or-nothing: every partition and every column either is, or is not, used for a query. Late materialization would mean that only the values needed for filtering & aggregation would be read in the scan+filter stages, and any expressions requested by the user but not needed for filtering and aggregation would only be read/computed afterward. I can see how this will invite sequential consistency problems, in data sources where mutations like DML or compactions are happening behind the query's back, but presumably Spark users already have this class of problem, it's just less serious when the end-to-end execution time of a query is shorter. WDYT? -0xe1a On Wed, May 31, 2023 at 11:03 AM Alex Cruise <a...@cluonflux.com> wrote: > Hey folks, I'm building a Spark connector for my company's proprietary > data lake... That project is going fine despite the near total lack of > documentation. ;) > > In parallel, I'm also trying to figure out a better story for when humans > inevitably `select * from 100_trillion_rows`, glance at the first page, > then walk away forever. The traditional RDBMS approach seems to be to keep > a lot of state in server-side cursors, so they can eagerly fetch only the > first few pages of results and go to sleep until the user advances the > cursor, at which point we wake up and fetch a few more pages. > > After some cursory googling about how Trino handles this nightmare > scenario, I found https://github.com/trinodb/trino/issues/49 and its > child https://github.com/trinodb/trino/pull/602, which appear to be based > on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is > what HyPerDB (never open source, acquired by Tableau) was based on. > > IIUC this kind of optimization isn't really feasible in Spark at present, > due to the sharp distinction between transforms, which are always lazy, and > actions, which are always eager. However, given the very desirable > performance/efficiency benefits, I think it's worth starting this > conversation: if we wanted to do something like this, where would we start? > > Thanks! > > -0xe1a >