Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]

via GitHub Sat, 05 Jul 2025 13:19:05 -0700


JigaoLuo commented on issue #16374:
URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039853245


   > > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas),
   > > While proofreading the blog, I had one major general question: **What 
are the limitations of such an embedded index?**
   > > 
   > > * Is it limited to just one embedded index per file?
   > 
   > No -- you could put as many indexes as you want (of course each new index 
will consume space in the file and add something to the metadata
   > 
   > > * Is it only possible to have a file-level index? (From the example, it 
seems like the hashset index is only applied at the file level.)
   > 
   > No, it is possible to have indexes with whatever granularity you want (
   > 
   > > I imagine other blog readers might have similar questions about the 
limitations—or the potential—of this embedded_index approach.
   > 
   > Yes it is a good point -- we should make sure to point this out on the blog
   > 
   > > If there are no strict limitations, then my follow-up discussion is: 
Could we potentially **supercharge** Parquet with techniques inspired by 
proprietary file formats? For example:
   > > * A true HyperLogLog
   > > * Small materialized aggregates (like precomputed sums at the column 
chunk or data page level) [For example with Clickbench Q3: a global AVG just 
needs the metadata, once the precomputed sum and the total rowcount are there.]
   > > * Even histograms or hashsets at the row group level (which would be a 
much more powerful version of min-max indexing for pruning)
   > 
   > Absolutely! Maybe just for fun someone could cook up special indexes for 
each clickbench query and put it in the files -- and we could show some truly 
crazy speed
   > 
   > The point would not be that any of those indexes is actually general 
purpose, but that parquet lets you put whatever. you like in it
   
   Thanks! This gave me the impression of a kind of **User-Defined Index**. I 
can now imagine that users could embed arbitrary binary data into this section 
of Parquet. As long as the Parquet reader knows how to interpret that binary 
using a corresponding **User-Defined Index Function**, it could enable powerful 
capabilities, such as pruning & precomputed results for query processing, or 
even query optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Add an example of embedding indexes *inside* a parquet file [datafusion]

Reply via email to

Re: [I] Add an example of embedding indexes inside a parquet file [datafusion]