Re: [I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

via GitHub Wed, 09 Oct 2024 04:57:19 -0700


tustvold commented on issue #6454:
URL: https://github.com/apache/arrow-rs/issues/6454#issuecomment-2402118020


   I've taken a look at the 
[reproducer](https://github.com/samuelcolvin/batson-perf/blob/main/src/main.rs) 
linked from 
https://github.com/apache/datafusion/issues/7845#issuecomment-2370455772 and 
I'm not sure that predicate pushdown is going to be helpful here.
   
   The query is `SELECT count(*) where json_contains(...)`, which implies:
   
   * Page index pushdown won't be effective as json_contains cannot be 
evaluated against the page index statistics
   * Late materialization, i.e. RowFilter, won't be advantageous as there is 
nothing to late materialize
   
   What follows are some avenues to explore
   
   **Potentially Incorrect DF ProjectionMask**
   
   However, there is something fishy here. Given the query doesn't actually 
request any columns, the final `ProjectionMask` should be empty. As we have a 
single predicate, we would therefore expect it to never perform record skipping 
at all - it would evaluate the predicate to compute the RowSelection, and then 
just use EmptyArrayReader to materialize this. 
   
   The trace, however, would suggest DF is requesting columns in the final 
projection, I wonder if DF requests filter columns in the projection mask even 
when the filter that needs them has been pushed down? This is probably 
something that could/should be fixed.
   
   **Adaptive Predicate Pushdown**
   
   Currently all `RowFilter` and `RowSelection` provided are used to perform 
late materialization. However, the only scenario where it actually makes sense 
to do this is:
   
   * The predicate is highly selective and the results are clustered
   * The other columns in the projection are comparatively more expensive to 
materialize than those required to evaluate the predicate
   
   The question then becomes what makes this judgement, currently I believe DF 
pushes everything down that it can. 
https://github.com/apache/arrow-rs/issues/5523 proposes adding some logic to 
the parquet reader to accept the pushed down predicate but choose to not use it 
for late materialization. This would have the effect of making it so that 
**pushing down a predicate is no worse than not pushing it down**.
   
   **Cache Decompressed Pages**
   
   Currently when evaluating a predicate, even if those columns are to be used 
again, either in another predicate or the final projection mask, the 
decompressed pages are not kept around. Keeping them around would have the 
advantage of saving CPU cycles at the cost of potentially significant 
additional memory usage, especially if the predicate is very selective.
   
   **Cache Filtered Columns**
   
   A potentially better option would be to retain the filtered columns for 
later usage, however, aside from being quite complex to implement this still 
runs the risk of blowing the memory budget
   
   **Lazy Predicate Evaluation**
   
   The problem with both of the above is that predicates are completely 
evaluated up-front. Whilst this makes the code much simpler, especially in 
async contexts, it has the major drawback that any caching strategy has to 
potentially retain a huge amount of decoded data. If instead we incrementally 
evaluated the filters as we went, we would be able to yield batches as we went.
   
   This would also improve the performance of queries with limits that can't be 
fully pushed down. I will try to find some time to work on this over the next 
few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `parquet::column::reader::GenericColumnReader::skip_records` still decompresses most data [arrow-rs]

Reply via email to