Lordworms commented on PR #13054: URL: https://github.com/apache/datafusion/pull/13054#issuecomment-2475214828
I repushed with an adaptive way to generate dynamic filter(when unique value exceed a threshold, generate range filter, otherwise inlist filter), and due to https://github.com/apache/datafusion/issues/13298 I didn't pushdown the filter down to parquet scan, I pushed the filter down to file stream, also tried with random data like ```python import pandas as pd import numpy as np part = pd.DataFrame({ 'p_partkey': np.random.randint(10, 21, size=1000), 'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000), 'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000) }) lineitem = pd.DataFrame({ 'l_partkey': np.random.randint(1, 1000001, size=100000000), 'l_quantity': np.random.uniform(1, 50, size=100000000), 'l_extendedprice': np.random.uniform(100, 10000, size=100000000) }) ``` and the result is  which only get like 40% of increase, I think we could try to pushdown filter down to parquet scan if the filter for reading parquet is optimized. Also perhaps we need to improve performance of Inlist since when I tried 20 elements, the performance is 5 times slower than using range filter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org