Hi Alex,

   - Your first assertion is correct. Regardless of Spark and back to
   Jurassic Park data processing , partition pruning and column pruning are
   either involved all or none. This means that for a given query, all
   partitions and columns are either used or not used. There is no selective
   reading of values based on the specific filtering and aggregation
   requirements of the query.
   - With "late materialization" on the other hand  you only consider values
   required for filtering and aggregation at the query processing time.
   Whatever else requested by the query that is not needed for filtering and
   aggregation would be done later.
   - I suppose your mileage varies because if the underlying data changes
   you are going to have potential problems that traditional databases have
   handled through Isolation levels. Spark Structured Streaming may introduce
   such challenges as well. So any traditional DML type changes as opposed to
   DQ, may impact this. I am not sure Spark can have this or rely on the
   default isolation level of the underlying data source. The brute force is
   that readers allow readers but block writers during query processing.

HTH

I assume that " late materialisation"  refers to the idea of deferring the
reading or computation of values until they are actually needed for
filtering and aggregation, rather than reading all values upfront.
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 31 May 2023 at 19:14, Alex Cruise <a...@cluonflux.com> wrote:

> Just to clarify briefly, in hopes that future searchers will find this
> thread... ;)
>
> IIUC at the moment, partition pruning and column pruning are
> all-or-nothing: every partition and every column either is, or is not, used
> for a query.
>
> Late materialization would mean that only the values needed for filtering
> & aggregation would be read in the scan+filter stages, and any expressions
> requested by the user but not needed for filtering and aggregation would
> only be read/computed afterward.
>
> I can see how this will invite sequential consistency problems, in data
> sources where mutations like DML or compactions are happening behind the
> query's back, but presumably Spark users already have this class of
> problem, it's just less serious when the end-to-end execution time of a
> query is shorter.
>
> WDYT?
>
> -0xe1a
>
> On Wed, May 31, 2023 at 11:03 AM Alex Cruise <a...@cluonflux.com> wrote:
>
>> Hey folks, I'm building a Spark connector for my company's proprietary
>> data lake... That project is going fine despite the near total lack of
>> documentation. ;)
>>
>> In parallel, I'm also trying to figure out a better story for when humans
>> inevitably `select * from 100_trillion_rows`, glance at the first page,
>> then walk away forever. The traditional RDBMS approach seems to be to keep
>> a lot of state in server-side cursors, so they can eagerly fetch only the
>> first few pages of results and go to sleep until the user advances the
>> cursor, at which point we wake up and fetch a few more pages.
>>
>> After some cursory googling about how Trino handles this nightmare
>> scenario, I found https://github.com/trinodb/trino/issues/49 and its
>> child https://github.com/trinodb/trino/pull/602, which appear to be
>> based on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf,
>> which is what HyPerDB (never open source, acquired by Tableau) was based on.
>>
>> IIUC this kind of optimization isn't really feasible in Spark at present,
>> due to the sharp distinction between transforms, which are always lazy, and
>> actions, which are always eager. However, given the very desirable
>> performance/efficiency benefits, I think it's worth starting this
>> conversation: if we wanted to do something like this, where would we start?
>>
>> Thanks!
>>
>> -0xe1a
>>
>

Reply via email to