Late materialization?

Alex Cruise Wed, 31 May 2023 11:04:22 -0700

Hey folks, I'm building a Spark connector for my company's proprietary data
lake... That project is going fine despite the near total lack of
documentation. ;)


In parallel, I'm also trying to figure out a better story for when humans
inevitably `select * from 100_trillion_rows`, glance at the first page,
then walk away forever. The traditional RDBMS approach seems to be to keep
a lot of state in server-side cursors, so they can eagerly fetch only the
first few pages of results and go to sleep until the user advances the
cursor, at which point we wake up and fetch a few more pages.

After some cursory googling about how Trino handles this nightmare
scenario, I found https://github.com/trinodb/trino/issues/49 and its child
https://github.com/trinodb/trino/pull/602, which appear to be based on the
paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is what
HyPerDB (never open source, acquired by Tableau) was based on.

IIUC this kind of optimization isn't really feasible in Spark at present,
due to the sharp distinction between transforms, which are always lazy, and
actions, which are always eager. However, given the very desirable
performance/efficiency benefits, I think it's worth starting this
conversation: if we wanted to do something like this, where would we start?

Thanks!

-0xe1a

Late materialization?

Reply via email to