Lordworms commented on PR #13054:
URL: https://github.com/apache/datafusion/pull/13054#issuecomment-2475214828

   I repushed with an adaptive way to generate dynamic filter(when unique value 
exceed a threshold, generate range filter, otherwise inlist filter), and due to 
https://github.com/apache/datafusion/issues/13298 I didn't pushdown the filter 
down to parquet scan, I pushed the filter down to file stream, also tried with 
random data like
   ```python
   import pandas as pd
   import numpy as np
   
   part = pd.DataFrame({
       'p_partkey': np.random.randint(10, 21, size=1000),
       'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000),
       'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000)
   })
   
   lineitem = pd.DataFrame({
       'l_partkey': np.random.randint(1, 1000001, size=100000000),
       'l_quantity': np.random.uniform(1, 50, size=100000000),
       'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
   })
   ```
   
   and the result is 
   
![image](https://github.com/user-attachments/assets/d6243398-2d23-458f-b8f4-8efbcd956447)
   
   which only get like 40% of increase, I think we could try to pushdown filter 
down to parquet scan if the filter for reading parquet is optimized.
   
   Also perhaps we need to improve performance of Inlist since when I tried 20 
elements, the performance is 5 times slower than using range filter.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to