alamb commented on issue #14816: URL: https://github.com/apache/datafusion/issues/14816#issuecomment-2676188552
> We are using DataFusion to query Parquet files and wondering if the result of the query can be represented as a bit set of the document position (example below). Bit sets from the different engines can be intersected to identify the documents which meets the criteria. The resulting bit set then can be used to fetch the relevant documents from Parquet. I think there are two parts to your question: 1. Representing the results as a bitset: I think you would have to imlement a custom "pivot" type operation that took row ids somehow and created a bitset from them 2. Fetching only relevant documents from parquet: the curent reader is efficiently setup to fetch large contiguous blocks of values ([`RowSelection`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html)). @XiangpengHao has been thinking about a bitset representation for selected rows recently so perhaps you can help contribute to making that happen in the parquet reader -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org