alamb commented on issue #7363:
URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2792290735

   Thanks @zhuqi-lucas !
   
   Is there any way you could look into creating a benchmark for evaluating 
filters? I can do so too if you prefer
   
   The idea is to create a benchmarks for evaluating row filters (what 
@XiangpengHao is trying to optimize) that captures the common use case and is 
what we are trying to optimize in DataFusion 
   
   You add a row filter with this API:
   
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter
   
   I suggest a benchmarks that 
   1. Writes a parquet file with 100K rows and four columns (int64, float64, 
Utf8View, and Timestamp) into memory 
   2. Adds filters + projections
   3. Benchmarks how fast it is to read the data back
   
   For the filters, it is important to capture both selective filters (that 
select a small number of contiguous ranges) as well as non selective filters 
(that select rows that are scattered througout the data). Here are suggestions
   
   Filters:
   1. A string filter like `col <> ''` that selects about 1/2 of the data
   2. String Filter  like `col = 'const'` that is selective and selects only a 
few rows
   3. Integer filter like `col = ''` (both selective and non selective)
   4. Timestamp filter like 'ts > time'
   
   For the projections, it is important to capture both when the column with 
the predicate appears at the output as well as when the column wihtout the 
predicate appears. Here are suggestions
   
   Projections (which columns are selected out):
   1. All 4 columns
   2. Some column other than the filter column
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to