Hey folks, I'm building a Spark connector for my company's proprietary data lake... That project is going fine despite the near total lack of documentation. ;)
In parallel, I'm also trying to figure out a better story for when humans inevitably `select * from 100_trillion_rows`, glance at the first page, then walk away forever. The traditional RDBMS approach seems to be to keep a lot of state in server-side cursors, so they can eagerly fetch only the first few pages of results and go to sleep until the user advances the cursor, at which point we wake up and fetch a few more pages. After some cursory googling about how Trino handles this nightmare scenario, I found https://github.com/trinodb/trino/issues/49 and its child https://github.com/trinodb/trino/pull/602, which appear to be based on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is what HyPerDB (never open source, acquired by Tableau) was based on. IIUC this kind of optimization isn't really feasible in Spark at present, due to the sharp distinction between transforms, which are always lazy, and actions, which are always eager. However, given the very desirable performance/efficiency benefits, I think it's worth starting this conversation: if we wanted to do something like this, where would we start? Thanks! -0xe1a