alamb commented on issue #7363: URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2792290735
Thanks @zhuqi-lucas ! Is there any way you could look into creating a benchmark for evaluating filters? I can do so too if you prefer The idea is to create a benchmarks for evaluating row filters (what @XiangpengHao is trying to optimize) that captures the common use case and is what we are trying to optimize in DataFusion You add a row filter with this API: https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter I suggest a benchmarks that 1. Writes a parquet file with 100K rows and four columns (int64, float64, Utf8View, and Timestamp) into memory 2. Adds filters + projections 3. Benchmarks how fast it is to read the data back For the filters, it is important to capture both selective filters (that select a small number of contiguous ranges) as well as non selective filters (that select rows that are scattered througout the data). Here are suggestions Filters: 1. A string filter like `col <> ''` that selects about 1/2 of the data 2. String Filter like `col = 'const'` that is selective and selects only a few rows 3. Integer filter like `col = ''` (both selective and non selective) 4. Timestamp filter like 'ts > time' For the projections, it is important to capture both when the column with the predicate appears at the output as well as when the column wihtout the predicate appears. Here are suggestions Projections (which columns are selected out): 1. All 4 columns 2. Some column other than the filter column -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org