Hi Alex,
- Your first assertion is correct. Regardless of Spark and back to Jurassic Park data processing , partition pruning and column pruning are either involved all or none. This means that for a given query, all partitions and columns are either used or not used. There is no selective reading of values based on the specific filtering and aggregation requirements of the query. - With "late materialization" on the other hand you only consider values required for filtering and aggregation at the query processing time. Whatever else requested by the query that is not needed for filtering and aggregation would be done later. - I suppose your mileage varies because if the underlying data changes you are going to have potential problems that traditional databases have handled through Isolation levels. Spark Structured Streaming may introduce such challenges as well. So any traditional DML type changes as opposed to DQ, may impact this. I am not sure Spark can have this or rely on the default isolation level of the underlying data source. The brute force is that readers allow readers but block writers during query processing. HTH I assume that " late materialisation" refers to the idea of deferring the reading or computation of values until they are actually needed for filtering and aggregation, rather than reading all values upfront. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 31 May 2023 at 19:14, Alex Cruise <a...@cluonflux.com> wrote: > Just to clarify briefly, in hopes that future searchers will find this > thread... ;) > > IIUC at the moment, partition pruning and column pruning are > all-or-nothing: every partition and every column either is, or is not, used > for a query. > > Late materialization would mean that only the values needed for filtering > & aggregation would be read in the scan+filter stages, and any expressions > requested by the user but not needed for filtering and aggregation would > only be read/computed afterward. > > I can see how this will invite sequential consistency problems, in data > sources where mutations like DML or compactions are happening behind the > query's back, but presumably Spark users already have this class of > problem, it's just less serious when the end-to-end execution time of a > query is shorter. > > WDYT? > > -0xe1a > > On Wed, May 31, 2023 at 11:03 AM Alex Cruise <a...@cluonflux.com> wrote: > >> Hey folks, I'm building a Spark connector for my company's proprietary >> data lake... That project is going fine despite the near total lack of >> documentation. ;) >> >> In parallel, I'm also trying to figure out a better story for when humans >> inevitably `select * from 100_trillion_rows`, glance at the first page, >> then walk away forever. The traditional RDBMS approach seems to be to keep >> a lot of state in server-side cursors, so they can eagerly fetch only the >> first few pages of results and go to sleep until the user advances the >> cursor, at which point we wake up and fetch a few more pages. >> >> After some cursory googling about how Trino handles this nightmare >> scenario, I found https://github.com/trinodb/trino/issues/49 and its >> child https://github.com/trinodb/trino/pull/602, which appear to be >> based on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, >> which is what HyPerDB (never open source, acquired by Tableau) was based on. >> >> IIUC this kind of optimization isn't really feasible in Spark at present, >> due to the sharp distinction between transforms, which are always lazy, and >> actions, which are always eager. However, given the very desirable >> performance/efficiency benefits, I think it's worth starting this >> conversation: if we wanted to do something like this, where would we start? >> >> Thanks! >> >> -0xe1a >> >