Just to clarify briefly, in hopes that future searchers will find this
thread... ;)

IIUC at the moment, partition pruning and column pruning are
all-or-nothing: every partition and every column either is, or is not, used
for a query.

Late materialization would mean that only the values needed for filtering &
aggregation would be read in the scan+filter stages, and any expressions
requested by the user but not needed for filtering and aggregation would
only be read/computed afterward.

I can see how this will invite sequential consistency problems, in data
sources where mutations like DML or compactions are happening behind the
query's back, but presumably Spark users already have this class of
problem, it's just less serious when the end-to-end execution time of a
query is shorter.

WDYT?

-0xe1a

On Wed, May 31, 2023 at 11:03 AM Alex Cruise <a...@cluonflux.com> wrote:

> Hey folks, I'm building a Spark connector for my company's proprietary
> data lake... That project is going fine despite the near total lack of
> documentation. ;)
>
> In parallel, I'm also trying to figure out a better story for when humans
> inevitably `select * from 100_trillion_rows`, glance at the first page,
> then walk away forever. The traditional RDBMS approach seems to be to keep
> a lot of state in server-side cursors, so they can eagerly fetch only the
> first few pages of results and go to sleep until the user advances the
> cursor, at which point we wake up and fetch a few more pages.
>
> After some cursory googling about how Trino handles this nightmare
> scenario, I found https://github.com/trinodb/trino/issues/49 and its
> child https://github.com/trinodb/trino/pull/602, which appear to be based
> on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is
> what HyPerDB (never open source, acquired by Tableau) was based on.
>
> IIUC this kind of optimization isn't really feasible in Spark at present,
> due to the sharp distinction between transforms, which are always lazy, and
> actions, which are always eager. However, given the very desirable
> performance/efficiency benefits, I think it's worth starting this
> conversation: if we wanted to do something like this, where would we start?
>
> Thanks!
>
> -0xe1a
>

Reply via email to