JigaoLuo commented on issue #16374: URL: https://github.com/apache/datafusion/issues/16374#issuecomment-3039853245
> > Hi [@zhuqi-lucas](https://github.com/zhuqi-lucas), > > While proofreading the blog, I had one major general question: **What are the limitations of such an embedded index?** > > > > * Is it limited to just one embedded index per file? > > No -- you could put as many indexes as you want (of course each new index will consume space in the file and add something to the metadata > > > * Is it only possible to have a file-level index? (From the example, it seems like the hashset index is only applied at the file level.) > > No, it is possible to have indexes with whatever granularity you want ( > > > I imagine other blog readers might have similar questions about the limitations—or the potential—of this embedded_index approach. > > Yes it is a good point -- we should make sure to point this out on the blog > > > If there are no strict limitations, then my follow-up discussion is: Could we potentially **supercharge** Parquet with techniques inspired by proprietary file formats? For example: > > * A true HyperLogLog > > * Small materialized aggregates (like precomputed sums at the column chunk or data page level) [For example with Clickbench Q3: a global AVG just needs the metadata, once the precomputed sum and the total rowcount are there.] > > * Even histograms or hashsets at the row group level (which would be a much more powerful version of min-max indexing for pruning) > > Absolutely! Maybe just for fun someone could cook up special indexes for each clickbench query and put it in the files -- and we could show some truly crazy speed > > The point would not be that any of those indexes is actually general purpose, but that parquet lets you put whatever. you like in it Thanks! This gave me the impression of a kind of **User-Defined Index**. I can now imagine that users could embed arbitrary binary data into this section of Parquet. As long as the Parquet reader knows how to interpret that binary using a corresponding **User-Defined Index Function**, it could enable powerful capabilities, such as pruning & precomputed results for query processing, or even query optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org